DOCUMENT RESUME 



ED 066 893 



EM 010 196 



AUTHOR 

TITLE 

INSTITUTION 
SPONS AGENCY 
REPORT NO 
PUB DATE 
NOTE 



Bergman, Brian A.; Siegel, Arthur I. 

Training Evaluation and Student Achievement 
Measurement: A Review of the Literature. 

Applied Psychological Services, Inc., Wayne. Pa. 

Air Force Human Resources Lab., Lowry AFB, Colo. 

AFHRL-TR-72-3 

Jan 72 

67p. 



EDRS PRICE MF-$0.65 HC-$3.29 

DESCRIPTORS ♦Achievement; Branching; Computer Assisted 

Instruction; Confidence Testing; Cost Effectiveness; 
Criterion Referenced Tests; Curriculum Development; 
♦Evaluation Methods; Instructional Technology; 
Learning Modalities; ♦Literature Reviews; 
♦Measurement Techniques; Military Training; 
Motivation; Statistical Analysis; Systems Approach; 
Testing; ♦Training Techniques 



ABSTRACT 

, Training evaluation and student achievement 

measurement literature is reviewed with primary emphasis placed on 
studies reported in the last 10 years. Recent trends in training 
evaluation and student achievement measurement are presented. Factors 
relating to this topic, such as statistical methods, course 
development methods, training techniques, learning styles, 
motivation, and moderator variables are also included. Where new 
methods of treiining evaluation and student achievement measurement 
appear in the literature, detailed presentations are given. Among 
these procedures were cost-effectiveness or cost-benefit analysis, 
criterion-referenced testing, sequential testing, confidence testing, 
convergent and discriminant validity, and computer-assisted branched 
testing. Conclusions are that systematic approaches to evaluation and 
course development are receiving rao.:e and more attention. Most 
systems begin with a job analysis in order to derive a list of 
behaviorally-oriented job requirements from which training objectives 
can be formulated. The new techniques in evaluation and measurement 
have resulted from attempts to determine whether training objectives 
have been realized. (Author/JK) 
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ABSTRACT 



, evaluation and student measurement literature is reviewed The 

S' “ hLrh^ 

«uaies which have impacted heavily on recent trends are also included Because of the 
bvious interaction between both training evaluation and student measurement on the 
one hand, and such topics as statistical methods, methods for courTdeveloDlnt 

"’ot'vation, and moderator variables, on the othefhand’ 
these and similar considerations are also included. ’ 



SUMMARY 



Bergman, B.A., & Siegel, A.I. Training evaluation and student achievement measurement: A review of the 
literature. AFHRL-TR-72-3. Lowry AFB, Colo.: Technical Training Division, Air Force Human 
Resources Laboratory, January 1972. 

Problem 

The purpose of this paper is to review the training evaluation and student achievement measurement 
literature with primary emphasis being placed on studies reported in the last ten years. 

Approach 

Recent trends in training evaluation and student achievement measurement are presented. Because of 
the obvious interaction between both training evaluation and student measurement, on the one hand, and 
such topics as statistical methods, course development methods, training techniques, learning styles, 
motivation, and moderator variables, on the other hand, these and similar considerations are also included. 

Results 

Where new methods of training evaluation and student achievement measurement appeared in the 
literature, detailed presentations were given. Among these procedures were cost-effectiveness or cost -benefit 
analysis, criterion-referenced testing, sequential testing, confidence testing, convergent and discriminant 
validity, and computer assisted branched testing. 

Conclusions 

Systematic approaches to evaluation and course development are receiving more and more attention. 
Most systems begin with a job analysis in order to derive a list of behaviorally oriented job requirements 
from which training objectives can be formulated. The new techniques in evaluation and measurement have 
resulted from attempts to determine v^hether training objectives have been realized. 

This summary was prepared by Wayne S. Sellman, Technical Training Division, Air Force Human 
Resources Laboratory. 
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TRAINING EVALUATION AND STUDENT ACHIEVEMENT MEASUREMENT: 
A REVIEW OF THE LITERATURE 



I. INTRODUCTION 

Methods and procedures for evaluating training 
courses and student achievement have been slowly 
evolving and assuming increased stature within any 
training program developmental paradigm which 
aims to be at all complete. This increased emphasis 
on training evaluation and student measurement is 
due, in part, to the increased realization that there 
can be no training system without quality control. 
Training in this sense is viewed as a process 
(analogous to a chemical or manufacturing 
process) in which raw material (students) is 
converted frorr. one form to another (skilled 
craftsmen). Within such a construct, there must be 
a quality control stage; training evaluation and 
student measurement represent the quality control 
stage in the training process. 

This report selectively renews the current 
1 i t e r a t ure related to training evaluation and 
student achievement measurement. The review 
period extends over the 20 years preceding 1970, 
although the emphasis is not evenly apportioned 
throughout the entire span. The first ten years of 
the period are only briefly covered. Advances of 
t'le last decade indicate that, except for historical 
perspective, the 1950 to I960 time frame should 
be treated rather lightly in a review such as this. 
Air Force flight equipment of the Korean War and 
immediate post-Korean War era is today looked 
hpoh as vintage equipment. Ten years ago, the 
digital computer, systems thinking, and 
programmed instruction were in their virtual 
infancy; and computer assisted training, T-group 
training, and behavior modification were all things 
of the future. Accordingly, the first decade of the 
review period has received only modest emphasis. 

The heavier emphasis in this review is the recent 
ten years, with the last five being most thoroughly 
covered.* The goal was to examine the subject 
matter areas but, most importantly, to determine 
for future reference, the answers to the questions 
“what is new in training evaluation?” and “what is 
new in student achievement measurement?” With 
these principal goals, placement of heaviest 
emphasis on the most contemporary time period 
seems clearly indicated. 



Sources Starched 

In order to identify relevant literature, the 
following sources were searched: Psychological 
Abstracts, Technical Abstract Bulletins of the 
Defense Documentation Center, aiid the (/. S, 
Government Research and Development Reports, 
published by the Department of Commerce. 

The Psychological Abstracts were reviewed 
from Number I of the 1966 volume through 
Number 4 of the 1971 volume, thus affording 
entry to the literature of the 1965-1970 period. 
The topics covered were Education and Training in 
the General section; Testing in the Methodology 
and Research Technology section; Testing, 
Counseling and Guidance, Teachers and Teacher 
Training, School Learning and Achievement in the 
Educational Psychology section; and Vocational 
Choice and Guidance, Selection, and Fiacement, 
and Training, in the Personnel and Industrial 
Psychology section. 

The Technical Abstract Bulletins were reviewed 
from Number 1 of the 1966 index volume to 
Number 24 of the 1970 volume. The topics 
searched in these index volumes were Evaluation, 
Performance, Personnel, and Testing. 

The U. S, Government Research and Deve- op- 
mem Reports reviewed were from issue Number I 
of 1968 to Number 12 of 1971. The major subject 
field searched was Behavioral and Social Sciences; 
the specific subfields examined were Human 
Factors Engineering, Man-Machine Relations, 
Personnel Selection, Training and Evaluation, and 
Psychology (Individual and Group Behavior). 

In addition to these systematic searches of 
source listings, the act of reading in the literature 
unearthed other literature of relevance. Partic- 
ularly valuable in suggesting articles and books of 
importance were issues of the Psychological 
Bulletin and appropriate chapters of the Annual 
Review of Psychology, Thus, as a result of the 
systematic examination of three listing sources, 
the utilization of other review and discussion 
articles which integrated much of the thinking in 
the subject fields, and the normal reading of the 
published materials of these fields, a degree of 
confidence can be manifested in the compre- 
hensiveness of the coverage of this review. 
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Training Evaluation am* Student 
Achievement Measurement 

Training evaluation and student achievement 
measurement in some ways involve similar con- 
structs, and in some ways they involve different 
constructs. Moreover, several different meanings 
have been attached to the term “training evalua- 
tion.” 

There are at least three major and quite 
different reasons for measuring student achieve- 
ment. The most lime-honored of these is for 
determining whether the student has mastered the 
prescribed subject matter and, hence, can be 
promoted, graduated, certified, licensed, or in 
some other way acknowledged. This type of 
student measurement takes place foi purposes of 
evaluating tire student; and it is completely 
distinct from evaluating the training provided to 
the student, rr from other reasons for student 
measurement. 

A second reason for student measurement is to 
determine his subject matter areas of strength and 
weakness for reinforcement and feedback purposes 
and for diagnosis and subsequent remedial action. 
Mary automated, or programmed, instructional 
texts and devices provide for this type of measure- 
ment, as do most good tutors. This student 
measurement is an instructional technique, and it 
is completely distinct from evaluating either the 
student or the training 

Finally, student measurement is employed for 
purposes of drawing inferences about the effective 
ness of the instruction provided to the student. 
Other things being equal, it can be inferred that 
the more the students have achieved, the better 
the quality of the instruction. Student achieve- 
ment in this case is, indeed, a method of training 
evaluation. In only one, then, of the three uses of 
student measurement does student measurement 
overlap the topic of training evaluation. In the 
other two uses, student measurement is a distinct 
topic of interest without any necessary reference 
to training evaluation. 

The term training evaluation also has multiple 
meanings and has been appUed in a number of 
different contexts. At a minimum, one should 
distinguish comparative or relative training evalua- 
tions from more absolute evaluations of training. 
The first case involves the determination of which 
is best among a number of methods or programs 
for presenting the training content. The second 
case involves determination of how good the 
training is. 

In addition to the obvious syllogistic point that 
a particular program may be the best and yet not 



bo very good, the relative or absolute distinction 
has other implications for this review. The time 
frame covered has seen exceedingly rapid accelera- 
tion in the rate of development of new 
instructional methods. From Prcsscy and Skinner’s 
early teaching machines, to a number of different 
approaches to programmed texts, to computer 
assisted instruction, the “traditional” classroom 
has probably undergone more of a metamorphosis 
in this relatively brief time period than in all of its 
preceding years. And, with each new development, 
a multitude of evaluations comparing it either to 
traditional methods or to the last new develop- 
ment have appeared. The result has been a 
literature very full of comparative training 
evaluations. No attempt has been made to discuss 
more than a sample of these comparative evalua- 
tions. To do more would overbalance the review 
with, in many cases, rather trivial studies. 

The major thrust of this review is on systems, 
quantitative methods, and evaluations of training 
which have utilized niore absolute criteria. Such 
studies have maximum import for the quality 
control stage within an instructional system. This 
quality control stage in an Air Force context is 
concerned w'ith how well students arc prepared for 
job performance, not whether the Air Force’s 
method is better or worse than someone else's. 



II. DIMENSIONS OF EVALUATION 

Roles, Uses, and Characteristics 
of Evaluation 

Stake (1969) and his associates (Stake & 
Denny, 1969) differentiate between evaluation 
and scientiPic research, while admitting that both 
can overlap. Stake indicates that evaluation studies 
are concerned with worth or value while research 
studies are rarely concerned with these issues. 
Stake also defines what is meant by “high” and 
“low” forms of evaluation. In high forms of 
evaluation, the results are generalizable across 
schools, situations, and students. In the low form 
of evaluation, the findings are restricted to the 
specific research situation because the experi- 
mental conditions are not samples of the universe 
of conditions. This delineation of the high and low 
forms of evaluation is analogous to the random 
and fixed-effects models referred to in statistical 
(analysis of variance) contexts. Nonetheless, many 
persons engaged in student measurement and 
training evaluation research have used fixed-effects 
designs and then erroneously generalized to other 
programs of instruction. 
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Flanagan (1 969) and Bloom (1969) define what 
is meant by the terms “formative” and 
"summative” evaluation. Formative evaluation is a 
process concerned with the development of an 
educational program. Summative evaluation, 
thougli, is primarily concerned with evaluation at 
the end of a program. Stake (1969) feels that this 
distinction between summative and formative 
evaluation is trivial since formative evaluation 
never ends for the instructors and program 
developers. A program is summative only for 
someone who is outside the program and looking 
in for a statement of its effects. 

Thelen fl969) feels that the role of evaluative 
measurement is .:. feedback, diagnosis, and 
steering.. ” of the student. Merwin (1969), 
taking a uroader view, thinks that there are three 
role' for evaluation: (a) school planning and 
administration which includes pupil classincation, 
diagnosis of learning disabilities, appraisal of pupil 
progress, identiPication of special aptitudes, pupil 
promotion, and effectiveness of teaching; (b) in- 
struction, its diagnosis and effectiveness; and (c) 
student decision making or helping the students to 
plan and evaluate their own educational experi- 
ences. Similarly, Cronbach (1963) lists course 
improvement, decisions about Individuals, and 
administrative regulation as the purposes of 
evaluation. 

Wittrock (1970) defines evaluation as m ’ving 
decisions and judgments about instruction 
causes of learning. It is noted that such jud^ nents 
of causal relations are difficult, inasmuch as differ- 
ential psychology has studied individual 
differences tp the exclusion of cause and effect 
relations among learners, educational environ- 
me iis, and learning. The evaluation of instruction, 
according to Wittrock, should include observation 
of the student’s environment (e,g., teacher 
characteristics, student background), evaluation of 
the learners via achievement testing, and evalua- 
tion of learning or of permanent behavior changes. 
Denova (1968), using a similar paradigm, says that 
evaluation has three components: assessing 
changes in employee (student) behavior; observing 
whether training helps achieve organizational 
goals; and evaluating the training programs, tech- 
niques, and personnel. 

G. Johnson (1970) lists three characteristics of 
evaluation: establishing merit, applications, and 
multidimensionality. Johnson’s dimensions of 
evaluation are objectives, processes, components, 
end-products, environmental context, secondary 
or unplanned effects, and costs. 



Angeli, Shearer, and Berliner (1964) list four 
uses for evaluation data: (a) early detection and 
correction of behavior; (/?) continual modification 
of instructional procedures when appropriate; (r) 
knowledge of whether desired achievement levels 
have been attained; and (c/) acquisition of learning 
curves. 

According to Gagne (1970), evaluation has two 
meanings. The first meaning of evaluation involves 
the determination of the worth of a system or 
program, and the second meaning involves deter- 
mining if learning has occurred. These uses appear 
to be directly analogous to the topic of this litera- 
ture review. Provus (1969), emphasizing training 
functions, thinks that the purpose of evaluation is 
to determine whether to improve, keep, or end a 
program. Evaluation is agreement with program 
standards, determining if a discrepancy exists in 
some aspect of the program, and using this infor- 
mation to delineate the weak points of the system. 

Wiley (1970) compares and contrasts the con- 
cept of evaluation with the concepts of appraisal 
and assessment. According to Wiley, assessment 
and appraisal involve the process of . . judging 
what is valuable and ascertaining the particular 
levels of valued traits (p. 260).” Evaluation, 
though, is concerned only with the latter, and it 
must be empirical and behavioral. Appraisal, there- 
fore, involves a designative and an evaluative 
function. Continuing, Wiley says that ”. . . evalua- 
tion consists of the collection and use of infor- 
mation concerning changes in pupil behavior in 
order to mak* decisions about an educational 
program (p. 261).” 

Jaeger (1970) feels that evaluative techniques 
can be applied to institutional decision making and 
educational management. Evaluation can be 
helpful in allocation of resources in terms of 
educational need, in modification of school pro- 
grams, and in promotion of public unde standing 
of the meaning of test scores. 

Crawford (1969) and Berdie (1969) both have 
rather contrasting views of evaluation usage. 
Crawford feels that the goals of evaluation are 
increased efficiency, decreased time, and decreased 
costs. Berdie, though, feels that the uses of evalua- 
tion are educational, vocational, and individual. 

Perhaps the best statement of the use of evalua- 
tion is given by Hemphill (1969). He says that the 
worth of an evaluation study is based . .on its 
contribution to a rational decision process in 
which it is necessary to estimate the probability of 
a desirable but uncertain outcome of an action 
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chosen from a number of alternative actions (p. 
219)/’ In this sense, evaluation is an aid to the 
decision making processes. 

Thus, cdiicatidnal evaluation has meant a 
number of different things to different people. 
The literature indicates it to be multidimensional 
in purposes, and these purposes seem to \ ry 
across the goals of the evaluators. Few have 
separated measurement (the act of deriving data) 
from evaluation (the judgments) made on the basis 
of the data. Such a taxonomy might represent at 
least an initial step toward providing a unifying 
conceptual scheme. In this sense, educational 
evaluation is a process which is used to make 
decisions with regard to instructional programs, 
instructors, students, institutional planning, 
administration, and costs. Measurement represents 
a set of techniques which are applied to derive the 
data on which the evaluation is based. 

Specification of Objectives 

Many writers (e.g.. Bloom, 1969; Flanagan, 
1969; Glaser, 1967, 1970; Glaser & Glanzer, 1958; 
La\dnsky, 1969; Peck & Dingham, 1968; Waina, 
1969, Whitmore 1970a, 1970b, 1970c, 1970d) 
have stressed the need for a carefully specified set 
of objectives as a precursor to training and evalua- 
tion. While this seems self-evident, early specific- 
ation of objectives often seems to be ignored. Most 
of the sources indicate that objectives should be 
defined in terms of skills and behaviors. An 
essential step, then, prior to the specification of 
objectives is a behavioral job analysis from which 
the bas’c job requirements can be derived. This 
process should result in a training program 
composed of small, discrete units with each unit 
having its Ovvn objective. Wittrock (1970) and 
Cronbach (1963> add that the specification of 
behavioral objectives allows absolute rather than 
relative student measurement. This enables one to 
determine who has and who has not achieved the 
objectives rather than who scores best or worst. 

Bloom (1969) suggests that there should be 
considciction of the intangible outcomes of 
instruction. The intangible outcomes may be 
desirable (e.g., stimulation of extra reading) or 
undesirable (e.g:, dislike of subject matter), which 
can lead to a revision or change in the educational 
objectives. These outcomes, however, seem quite 
amorphous and subject to considerable measure- 
ment error. 

At a still higher level of abstraction. Carpenter 
and Rapp (1969) would determine the objectives 
of training by removing any objective which is 
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dependent upon another (a concept which is theo- 
reiically neat but impractical); eliminating any 
objective that will not be affected by the choice of 
alternatives (a rather nonempirically defined con- 
cept); and finding an abstract objecthc to which 
all of the alternative objectives are means (which 
leaves the weighting of the alternative object! v ^ 
open). 

Thus, the determination and specification of 
objectives can assune a number of levels. These 
range from “objectively” derived statements of 
required skills and knowledgeo through motiva- 
tional constructs and finally through complete 
abstraction. 

Systematic Approaches to Course 
Development 

Approach^js to course development have also 
ranged from broad based molar systems through 
more discrete and molecular methods. 

Carss (1969) advocates the use of a flow chart 
model of the educational system components in 
order to derive a course. This model should 
cordain the flow of behaviors or acts needed to 
complete training. In the operation of the educa- 
tional system, the relevant variables are identified 
and quantified and converted into formulae to 
determine the effect of output (e.g, student 
behavior) when different inputs are considered. 
This is a simulation technique because one does 
not need to intervene in the school. In addition. 
Carpenter and Rapp (1969) add the obvious point 
that when different systems are being compared, 
all of their aspects which could affect output 
should be the same except for those being studied. 

In an earlier paper, Glaser ;ind Glanzer (1958) 
listed four requirements for course development: 

1. Specification of objectives— A list of the 
objectives of the course in behavioral te»^ms. 

2. Input control— Ihe selection of enrollees 
into the training program (e.g., number of 
men available, testing costs, etc.) 

3. Techniques and methods of training— 
Decisions regarding the amount of practice, 
learning guidance, reinforcement, extinction, 
training sequence, meaningful relationships 
in learning, use of punishment, learning 
plateaus, motivation, individual differences,, 
etc. 

4. Output con/ro/-Measurement of training 
(e.g, formative evaluation, setting of profi- 
ciency standards, diagnosis of training in- 
adequacies, performance tests, etc.). 
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Osborn (1970) presents an interesting model 
which he calls a “closed loop” approach. However, 
as early as 1950, workers in the area have regarded 
training evaluation to feed back to the instruc- 
tional process. Thus, the closed loop concept 
would not be regarded as a “new” development. 
Osborn indicates that job requirements lead to 
training objectives which result in training content 
and performance tests which ultimately yield an 
evaluation of the quality of student performance 
in terms of job requirements. Osborn feels that it 
is often too costly to develop a full field perform- 
ance test for a large number of individuals. He 
suggests a matrix approach as the solution to this 
dilemma. First, the job components (behaviors) 
are listed across the top of the page. Down the left 
side of the page is a list of the potential test 
methods graded in de[a*ee of complexity from full 
field to paper-and-pencil(e.g:, simulations, photos, 
pictures, drawings). Osborn contends that many 
times it is necessary to compromise— to sacrifice 
relevance and diagnostic capability for economy. 
The alternatives must be considered, and then the 
most complex, yet feasible method, must be 
selected and used. 

The sequence of course development used in 
the Army’s Trainfire I program (Crawford, 1969) 
includes (a) job analysis; (b) transfer of the job 
description into a test of how well the man 
performs the necessary skills; (c) development of 
new training stressing realism, clarity, and 
simplicity; and (d) experimentation using a con- 
ventionally trained group and an experimentally 
trained group which are compared on the test. 

Glaser (Glaser, 1970a, 1970b; Glaser & Cox, 
1968) presents a somewhat more elaborate model 
than his earlier version (Glaser & Glanzer, 1958). 
This new model includes the following: 

1. Specificatiqn of objectives in terms of 
observable behavior. Criterion-referenced 
measures indicate the content of the 
subject’s behavior in regard to the objectives 
and without regard to the performance of 
others. 

2. Diagnosis and profiling of the subject enter- 
ing instruction. The types of entering 
behavior that need measurement are 
previous extent of achievement in the 
subject area, prerequisites, learning set 
variables, ability to make discriminations, 
and general intelligence. 

3. Selection of “instructional alternatives” 
based on the diagnositic and profiling step of 
the system. 



4. Continuous assesi^^ment and monitoring 
which can include frequency of correct 
answers, errors in relation to a standard, 
speed, transfer and generalization, attention 
span, and response latency. 

5. Adaption and optimizatiori. The treatments 
a nd individual differences may interact ; 
therefore, individuals should be adapted to 
the best treatment. Those that interact most 
with the treatment are the most important. 
Decisions about treatments should be made 
sequentially, and these should be optimized 
by using quantitative methods. 

6. Evolution or self-contained improvement 
capability that modifies itself after 
acquisition of new knowledge. 

A system which mirrors much of the prior 
thinking is the Instructional System Development 
(ISD) technique developed by the United States 
Air Force (Air Force Manual 50-2, 1970). This 
system in its latest form contains the following 
steps: 

1 . Analyze system requirements 

2. Define education or training requirements 

3. Develop objectives and tests 

4. Plan, disvelop, and validate instruction 

5. Conduct and evaluate instruction 

Hunter, Lyons, MacCaslin, Smith, and Wagner 
(1969) feel that training program content must be 
job relevant. Taking the seven-step Human 
Resources Research Organization method of 
curriculum development and applying it to what 
the services are doing, they reported several 
findings, (a) System analysis for training purposes 
was not used in any of the services; (b) there was a 
requirement for task inventories in the Army and 
Air Force; (c) there was no development of a job 
model for any service; (d) there was no task 
analysis for curriculum development; (e) all serv- 
ices said training objectives should be job relevant 
but no provision was made for specificity; (/) 
training program development procedures were 
not maximally effective because the objectives 
were not fully specified; (g) very little or no 
evaluation and assessment of training effects (the 
Air Force had the only standards of graduate 
behavior and was the only service to perform field 
visits); and (fi) training accounted for 6 percent of 
the defense budget. 

In summation, the systematic approaches to 
course development attempt to account for almost 
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alt of the variables that can affect training and 
student behavior. Most of tlie systems begin with 
job analysis in order to derive a set of behavioral 
job requirements from which training objectives 
can be formulated. Many writers advocate a pre- 
training assessment of the entering students in 
order to channel tnem to the training program 
which is most suited to their needs and abilities. 
Performance tests and other measures of student 
behavior are then constructed in order to reflect 
the training objectives. Finally, after raining the 
students, training programs are evaluated 
through various means. 

Measures and Methods of Evaluation 

Campbell (1971) presents a rather dim picture 
of the current state of methodology in training 
and evaluation literature. He feels . . by and 
large, the training and development literature is 
voluminous, nonempirical, nontheoretical, poorly 
written, and dull (p. 565).” Continuing, Campbell 
says that . . In oum, the methodology of train- 
ing and development research cries for in- 
novation. ... As yet we have no workable 
technology that is capable of producing a large 
amount of training research data(p. 579).” 

Similarly, Schultz and Siegel (1961a, 1961b) as 
the result of a comprehensive review, observed 
earlier a need for a unifying conceptual structure 
with more emphasis on theoretical development in 
the area of job performance rather than technical 
advancements. They argued for more research 
based on an integrative theoretical framework 
rather than on an inductive framework. 

Campbell, Dunnette, Lawler, and Weick( 1970) 
divide training criteria into two groups. Internal 
criteria are those directly concerned with the train- 
ing itself, while external criteria measure post- 
training or on-the-job behavior. These authors 
recommend the use of multiple criteria, each 
reflecting different aspects of the organization’s 
goals. Gagne (1970) presents a similar dichotomy 
in which he stresses initial problems directly 
connected with the lesson and transfer problems 
involving principles tau^t in the lesson. 

Use of a composite overall criterion will un- 
doubtedly obfuscate important relationships since 
many of the subcriteria within the composite are 
probably orthogonal (Cronbach, 1963). According 
to Dunnette (1963), it is preferable to have 
multiple criteria in order to account for a greater 
proportion of the behavior variance. 

The evaluation or measurement must not be 
affected by the method of measurement or 



research procedure. Even the presence of the 
experimenter or the process of evaluation itself 
can alter the results (Bloom, 1969; Cronbach, 
1963). According to Gagne (1970) two evaluation 
criteria for measures are ''distinctiveness” and 
''freedom from distortion.” 

Weiss and Rein (1970) claim that broad based 
evaluation programs have design and technical 
problems so ponderous as to make any evaluation 
impractical and questionable. They propose a 
developmentally oriented, more qualitative evalua- 
tion as being more appropriate. Weiss and Rein 
imply that where there are many variables to 
consider, one can not possibly prove or disprove 
the values of any program. 

Biel (1962) says that ". . . fundamental 
criteria for evaluating a simulation-based training 
program or device is the extent of transfer of train- 
ing to the live situation. ... In cases. . . where 
ultimate criteria are obviously unavailable, inter- 
mediate criteria must be employed. One example 
of an intermediate criterion is performance in a final 
examination. . . Sometimes improvement as 
measured by performance on the training device 
itself is the best measure available of the effective- 
ness of the device and its associated training 
program (pp. 377-378).” Gagne (1968) has given a 
similar emphasis to transfer of training. 

Crawford (1962) and Glaser and Klaus (1962) 
posit that proficiency tests developed from job 
analysis should be employed to evaluate students 
and training. The standards on the proficiency test 
must be based on acceptable or adequate job 
behavior. 

Cronbach (1963) feels that, in training evalua- 
tion and student measurement, the testing of 
terminology which is specific to the training 
course should be kept independent from tests of 
understanding of content. A person who is not 
taking the course should be abK to understand 
(not necessarily answer) the question. Cronbach 
also classifies transfer of learning into an 
immediate and a long-term category. Immediate 
transfer involves testing the student’s course 
knowledge, while long-term transfer is concerned 
with aptitude gain and learning to learn. 

Angell, Shearer, and Berliner (1964) list three 
types of training measures; 

1. Initial measures given prior to instruction or 
training and which are used for selection 
purposes. The correlation between the 
selection tests and future performance 
should be high. 
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2. Interim measures taken while training is in 
progress, and . . they are more accurate- 
ly predictive of terminal proficiency than arc 
measures made earlier (p. 3).” 

3. Terminal measures obtained after training is 
compacted and which are predicted by the 
initial and interim measures. Some examples 
of terminal measures arc written tests, oral 
tests, performance tests, expert judgments, 
and rating scales. 

Peck and Dingman (1968) present a unique 
method of evaluating student teachers. Training is 
attained when each of the training objectives is 
reached by the student teacher, and these advances 
yield significant pupil gains in the classroom. 

Della-Piana and Berger (1970) have provided a 
design for conducting pilot studies on the 
efficiency of programmed instruction. They begin 
with six to eight subjects of above average ability 
who can give verbal f^eedback which is relevant to 
program revision. The subjects are split into groups 
of three or four each. The groups are presented 
with the programmed instruction, and, on 
completion of the training, they arc queried 
regarding possible revisions for the program. 

Thelen (1969) describes diagnosis (progression 
toward goals) and troubleshooting (difference 
between what exists and what ought to be) in the 
context of group instruction. In group instruction, 
the students are unsupervised most of the class 
time, and the instruc..or can only hope to sample 
their behavior. In a highly structured class, the 
evaluation is in an authoritarian framework in 
which student and teacher behavior are evaluated 
on several continua from good to poor. This can 
be considered evaluation of deviancy. In the un- 
structured class, no set of criteria for describing 
deviant behavior can exist. All behavior is thought 
to be relevant, and attempts are made to account 
for it, or to understai J why it occurred. The 
authoritarian teacher knows what is to be taught 
and determines the extent to which individuals 
differ in meeting expectations. The more demo- 
cratic instructor will use games, ungraded classes, 
small work groups, and student cohesiyeness. 
Finally, Thelen advocates the use of “barometric” 
individuals, or students who respond consistently 
and selectively to instruction or to some other 
important group condition. ^ 

Wiley (1970) advocates a system of evaluation 
which could lead to a great savings in time. First, if 
all the students in the class receive the same 
experimental treatment, then the appropriate 
statistical datum is the class, not the student. 



When the datim is a collective, one can sample 
from it and save considerable time. In addition, 
one docs not have to give each student all the 
items. Even single items can be used, and they are 
easier to interpret than total scores. Jaeger (1970) 
uses the aforementioned sampling strategy for 
institututional decision making. 

Wiley also introduces some new terminology in 
his descriptive system of evaluation. First, the 
standards of evaluation involve designating traits 
to evaluate and designating the levels that arc 
thought to be appropriate. Secondly, the object of 
evaluation is the instructional program and its 
component parts. Next, the vehicles of evaluation 
are directly affected by the objects, and they 
consist of students, classes, or schools. Finally, the 
instruments of evaluation display the behavior of 
the vehicles. Wiley says that the fundamental 
problem in evaluation “. . . . is to establish the 
effects of the objects on the vehicles by means of 
the instruments (p. 262).” 

Furno (1966) has an evaluation approach 
confined to educational surveys. The sequential 
elements in Furno’s system are (a) specification of 
survey objectives; (6) definition of the population; 
(c) description of what information is to be 
collected; (d) determination of the best mode of 
measurement; (e) selection of the sampling unit; 
if) selection of the sample; (g) planning of field 
work so that it will be carried out smoothly; (ft) 
conduction of pilot study; (/) provision for data 
processing; (f) analysis of data; and (ft) storing of 
survey information and providing for access when 
needed. 

Somewhat less elaborate are, Hawkiidge’s 
(1970) seven phases of evaluation research: (a) 
specification of objectives; (ft) selecticn of 
objectives to be measured; (c) selection of instru- 
ments and methods; (d) sample selection; (e) 
measurement and observation schedule develop- 
ment; (/) choosing analytic techniques; and (g) 
drawing conclusions and making recommenda- 
tions. 

Campbell (1970) suggests a completely selective 
approach including the use of an evaluation model 
which measures traim^ reactions, trainee learning, 
trainee behavior on the job, and results with regard 
to the organization. Campbell concludes that too 
many evaluation studies have focused on the 
measurement of trainee reaction (e,g, attitudes 
anJ opinions), to the exclusion of the other 
dependent measures. 

Flanagan’s (1969) system of evaluation includes 
(a) defining the outputs of the system including 
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the objectives »nd unplanned effects; {b) selecting 
the procedures needed to measure the worth of 
the outcomes {e.g., costs, benefits); and (c) 
composing a plan based on analysis including a 
decision and overall evaluation of the final pro* 
gram. 

Possibly, an evaluation which aims to be a i all 
complete should include consideration of most, if 
not all, of Scriven’s (1967) criteria. They include 
(a) knowledge of specific items of information and 
patterns and sequences of information items; {b) 
compreh'^nsion of internal relationships within the 
field inferences and implications), interfield 
relationships or the association between the 
knowledge of one field and that of another, and 
application of the field or its principles to an 
appropriate example; and (r) motivation and 
attitude toward the course, the subject, the field, 
field relevant materials, learning and knowledge 
activities in general, school, career teaching, the 
teacher, peers, and self. 

Problems of Evaluation 

As was mentioned previously, Campbell (1970) 
thinks that too mariy evaluation studies use 
measurement of trainee reactions to the exclusion 
of trainee learning, trainee behavior on the job, 
and effects on the organization. Trow (1970) feels 
that much innovation in training is done for its 
v‘wn s;ike to **elieve boredom and only secondarily 
for its outcomes. Evaluation studies are too often 
large-scale and aimed at funding agencies to prove 
that the innovation is of value. 

C. Harris (1970) points out that most investi- 
gators fail to integrate prior research into their 
experimental designs. He goes one step further by 
posing the question of integrating prior* research 
findings into numerical research analysis. Harris’ 
concept would be feasible if more collaboration 
could be achieved among different agencies and 
investigators. A related problem (Lortie, 1970) is 
whether or not ultimately too much centralized 
evaluation will be achieved (without realizing it) 
through the use of computers and data processing 
equipment. Clearly, an optimum middle ground 
must be found. 

Student measurement can have both positive 
and negative effects. The person being evaluated 
will always respond to evaluation in terms of the 
perceived fairness. If he perceives the evaluation as 
unfair, the person being evaluated may become 
resentful, especially if the evaluation is more 
critical to his career or to his student status 
(Bloom, 1970). 



Evaluation cannot function in an authoritarian 
society which resists social change. Evaluation also 
does not function well in an equalitarian society 
because all persons in it are considered equal. In 
actuality, evaluation functions best in a com- 
petitive society (Berdie, 1969). One must also 
consider the various publics at which the evalua- 
tion is aimed. These publics a.'-e trainees, trainers, 
sponsoring organizations, training technicians, and 
social scientists. The value of a particular type of 
training must be presented to the public with 
which it is concerned, and it may be different for 
each public (Bass, Thiagaiajan,& Ryterban, 1968). 

Walker (196S) performed a study illustrating 
one of Uie most serious problems in evaluation 
research. He asked 20 training experts to rate 16 
training techniques with regard to 34 training 
selection criteria. These training personnel tended 
to select training methods based on administrative 
and contractual needs to the exclusion of training 
methods based on educational and psychological 
principles. Walker concluded that tins group of 
training experts was more concerned with budget 
and training time than with learning. 

Berdie (1969) lists conceptual needs and 
problems of evaluation and measurement. He 
identifies the requirement to evaluate whole 
persons and the various ways in which traits 
cluster together; and, further, the need to know 
more about statistical as opposed to clinical 
prediction. Breadth of evaluation in addition to 
depth of evaluation must be considered; and 
various statistical modes of prediction must be 
attempted {e.g., moderator variables). 

Smode, Hall, and Meyer (1966) severely 
criticize Air Force evaluation research. They 
contend that {a) different dependent measures are 
often used across studies leading to incompara- 
bility of results; (Zi) too much stress is placed upon 
subjective opinions {e.g,, rating); (c) different 
limits or standards are used for describing perform- 
ance; (d) too many personnel and equipment 
changes occur during the execution of many 
studies resulting in a lack of proper research 
control; (e) different methods of processing and 
interpreting the transfer of training data are 
employed; (f) presentation of the same study in 
diff^erent reports makes it difficult to determine 
exactly what was done; (g) inadequate and 
imprecise criteria are used; (h) comparability and 
control of skill levels of subjects and trainees are 
lacking; (i) there is difficulty in matching research 
criteria and tasks to flight conditions and 
demands; and (f) there is disorganization and lack 
of cooperation among researchers. 



In a somewhat different context, Suchman 
(1967) presents a systematic overview of the short- 
comings of evaluation research in general. First, 
with regard to objectives, Suchman feels that 
certain excesses have tended to characterize the 
research: too much arbitrary problem selection; 
too much stress on resources and material and not 
enough on achievement; too much stress on 
quantity of services and record keeping at the 
expense of true evaluation; too much emphasis on 
program objectives based upon tredition and 
common sense; too much mixing of final, inter- 
mediate, and Immediate objectives; and too much 
idealism and not enough realism. 

In listing inadequacies regarding procedural 
methods, Suchman criticizes the excessive 
emphasis on research based on available or existing 
records which discourages the gathering of new 
data; the absence of sound experimental designs, 
thus making it difficult to determine if change is 
the result of innovation or chance; the use of 
measurements of unknown consistency and 
accuracy; the use of weighting methods and 
standards too often based upon rational rather 
than empirical means; the inadequate allowance 
for or control of demographic variables J.g;, 
locale, race, age) making interpretation difficult; 
and the over-emphasis on correlation with in- 
adequate attention to causality. 

Suchman also comments on the administration 
of evaluation studies, contending that evaluation 
guides are too often used by unsophisticated 
persons, thus making analysis and comparison of 
ratings difficult. Further, he suggests that self- 
evaluations are too often used, which allows bias 
to contaminate data. And, finally, when super- 
visors are forced to perform evaluations in 
addition to their usual activities, it becomes 
difficult to properly plan, organize, and conduct 
evaluation studies. 

What generalization can be extracted from this 
mass of critical rhetoric? First, these writers seem 
to think that there has been too much use of 
rational (armchair) rather than empirical methods. 
Similarly, they feel that evaluation research is too 
often subjective when objectivity is needed. 
Finally, evaluation research is too often limited by 
monetary considerations. The monetary criticism 
is probably the most important, since most of the 
other criticisms can be reduced to it. What most 
investigators do not realize is that cost cutting 
actually wastes money because the results of the 
research are at best uninterpretable. Many 
agencies, contractors, and others doing research 
might be well advised to save their money and do 
perhaps one or two sound research studies rather 
than five or six poor ones. 



Summary 

In the first section of this chapter, the roles, 
uses, and characteristics of evaluation were dis- 
cussed. Evaluation was differentiated from 
research. Formative and summative types of 
evaluation were discussed. Also, evaluation was 
contrasted with appraisal and assessment. It was 
concluded that evaluation is a process which is 
used to make decisions with regard to instructional 
programs, instructors, students, institutional 
planning, administration, and costs. 

The second part of this chapter contained a 
short discussion of objectives. Most of the sources 
reviewed seemed to indicate that each unit of 
training must have a behavioral objective bajed on 
the job requirements. 

The third portion of this chapter contained a 
systematic overview of approaches to evaluation 
and course development. These systems 
approaches to evaluation and course development 
attempt to account for almost all of the variables 
that can affect training and student behavior. 

The fourth segment of the chapter consisted. of 
a discussion of the measurement aspects of evalua- 
tion. There was a presentation of the various types 
of criteria that can be used in evaluation studies. 
Emphasis was placed on the multidimensional 
aspects of criterion measurement. Most of the 
writers reviewed suggested that transfer of lecining 
was the ultimate goal of training. Also, sampling 
procedures were suggested as a means of saving 
time and costs when the units of measurement are 
whole classes and schools. 

The fina. section of this chapter presented a 
discussion of the various problems and difficulties 
involved in evaluation studies. Several conclusions 
were drawn: 

1 . There is too much use of rational rather than 
empirical methods. 

2. There is too much subjectivity when 
objectivity is needed. 

3. Evaluation research is too often limited by 
monetary considerations. 



III. QUANTITATIVE METHODS AND 
DEPENDENT MEASURES 

Characteristics of Dependent Variables 

Fitzpatrick (1970) lists four characteristics of 
criteria which he thinks are essential for any 
evaluative measure. First, the criteria must be 
relevant to the objectives being measured. Second, 
the criteria must be comprehensive and cover all 



important objectives. Tliird, the criteria must be 
reliable within the limits of cost. Finally, the 
criteria selected must be feasible, and this is deter- 
mined almost solely by cost. 

Bloom (1970) also makes a set of very relevant 
comments concerning validity with regard to 
student measurement and training evaluation. 
Generally, content validity is stressed in training 
evaluation, while construct validity is emphasized 
in assessment and appraisal. Student measurement, 
thougli, usually emphasizes predictive and concur- 
rent validity. Bloom feels that the type necessary 
should be determined and not be confined to one 
or another. Bond and Rigney (1970) add that the 
dependent measure which **best predicts fmal 
performance” should always be selected. 

Several indices may be related to final perform- 
ance, and the computer can be used to choose and 
weight them. 

Gideonse (1968) lists several types of measures 
that can be used for measuring students and for 
training evaluation. Gideonse’s measures are (a) 
student achievement as measured by tests (which 
leaves m^ny of the student’s intellectual qualities 
untapped); (b) a desirable change after a stimulus 
input; (c) dropout or attrition rate;(^ attitudinal 
and motivational measures; (e) education levels; 
and (J) facilities, equipment, materials, human 
resources, pupil expenditure, non-school activities, 
organization patterns, and administrative agencies. 

Campbell and Dunnette (1968) add that most 
T-group research involves the use of attitude scales 
or opinion change as criteria rather than organiza- 
tional performance or improvement. 

Crawford (1967) indicates that proficiency 
tests, when used to evaluate training programs, 
should not just be used at the end of training, but 
should also be used to test retention after a period 
of disuse. Similarly, Martin (1957) divides criteria 
into those based on the content of the training 
program (internal criteria) and those based upon 
job behavior (external criteria). 

Englemann (1968) contends that there are two 
kinds of conditions which can indicate that learn- 
ing has occurred. In the fixed condition, a 
response or instance of behavior is used to show 
that learning has taken place. This is the criterion 
of performance. In the variable condition, several 
responses can show that learning has occurred. 
One can easily see that within this latter condition, 
it is easier for the student to demonstrate that be 
understands the concept being taught since the 
requirement for learning in the variable condition 
is dependent on a concept or rule and not on a 



response. Englemann adds that both the fixed and 
variable conditions arc needed depending upon the 
situation. 

Kelley and Kelley (1970) document a unique 
type of dependent measure for research which 
holds the traditional dependent variables of speed 
and accuracy constant. They work wi th an 
‘‘adaptive variable” which is the adjustment the 
student must make to obtain a certain score with 
speed and accuracy held constant. The adjustment 
is the dependent variable, and it can be any 
variable which affects performance. 

Test Construction 

Denova (1968) lists the steps in test con- 
struction as follows: (a) defining test scope, (!>) 
defining what is measured, (c) choosing items, (c/) 
choosing the most appropriate testing technique, 
(e) determining the number of items, (/) choosing 
final items, (g) arranging items, (h) writing clearly 
understandable directions, (/) constructing a 
scoring template, and (J) evaluating questions. 
Evaluation of the test, of course, involves such 
factors as((i) validity, (^) reliability, (c) simplicity, 
(d) distribution, (e) content, (J) objectivity, and 
(g) difficulty level. Other, more exhaustive, 
accounts of test construction .and its concomitant 
problems can be found in many sources such as 
Air Force Manual 50-9 (1967), Gronlund (1968), 
and Wood (1960). The remaining parts of this 
chapter, therefore, are devoted to some new tech- 
niques and applications. 

Horn (1966) feels that a predictor test must 
have internal consistency in order for it to corre- 
late adequately with a criterion. On the other 
hand, he feels that assessment tests need represent- 
ativeness of content regardless of internal con- 
sistency. He demonstrated that his own classroom 
assessment devices were more like predictors than 
assessors. Horn concludes that there is no reason 
why assessment devices must have low internal 
consistency reliability. 

McGuire and Babbott (1967) constructed a test 
for medical students consisting of a series of 
simulation exercises. The test begins with a case 
write-up and several possible courses of action or 
diagnoses. Each choice the student makes is 
branched to other choice points until the patient is 
either dead, transferred, or gets well. In the con- 
struction of the test, a panel of experts rated each 
choice along a five-point scale which ranged from 
‘‘clearly indicated’ to “clearly contra-indicated.” 
Several possible scores result from the procedure. 
The efficiency score is the percentage of the 
student’s answers which are helpful to the patient. 
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The proficiency score is tl\e percentage agreement 
with the criterion group (optimal patient care). 
Proficiency, then, is a combination of errors of 
commission and errors of omission. The composite 
score j.s a function of proficiency and efficiency. 
According to McGuire and Babbott, traditional 
multiple-choice tests take a portion of behavior 
and treat it independently of the total behavior 
pattern of which it is a part. Tliis stresses 
“product” as opposed to “process.” McGuire and 
Dabbott conclude that their test stresses the pro- 
cess aspects of behavior and that it is uncorrelated 
with most multiple-choice tests. 

Westbrook and Jones (1968) used a class of 
psycholog}^ graduate students to construct a 
multiple-choice test of Anastasi’s testing book. 
There were 54 items in form A and 54 items in 
form B. The Kuder-Richardson reliability was .73 
and the split-half reliability was .62. The tests were 
validated against a tcacher-made test, resulting in 
validities of .75 for form A and .59 for form B. 
Evidently, graduate students can be used to con- 
struct fairly reliable and valid tests. 

Gorth and Grayson (1969) developed a Fortran 
computer program which can “. . .compose and 
print any number of tests consisting of questions, 
multiple-choice or completion type, selected from 
an item pool (p. 173).” This program will make as 
many copies as is desired, randomize multiple- 
choice answers, and print scoring keys. Appar- 
ently, this program is for sale. 

Forrest (1970) wished to develop an objective 
flight test for private pilot certification. His test 
consists of a miniature sample of flying situations 
typically met by pilots. Each situation involves an 
evaluation and an action. The test measures (a) 
retention and recall, (b) judgment, (c) planning 
and problem solving, (cf) perceptui-motor co- 
ordination, and (e) habit. The actual test was a 
cross-country fli^t with a pre-flight and an in- 
flight phase (N- 15). Scores on the test correlated 
.50 with expert ratings. 

Hierarchical and Sequential Testing 

Hierarchical and sequential tests involve a 
sequence of branching in which the student only 
gets items at his own level. This procedure 
decreases testing time, increases reliability, and 
increases student motivation because he is not 
forced to take and guess at the more difficult 
items. The concept was introduced in early 
“intelligence tests” and has recently received new 
emphasis. An example of the application is the 
work of Cleary, Linn, and Rock (1968a, 1968b) 
who wished to use progiammed tests to decrease 



testing time while leaving reliability and validity 
the same. In the procedure described by Cleary, 
Linn, and Rock, each student receives a different 
set of items along a scale. Sequentially pro- 
granuned tests have a routing section which 
branches the subject to the appropriate items and 
a measurement section containing items of suitable 
difficulty. The routing section can be used alone, 
althougli tliese investigators used a combination. 
These authors used the test scores of 4,885 11th 
grade students on the School and College Ability 
Tests (SCAT) and Sequential Tests of Education^ 
Progress (STEP). The sample was divided in half, 
with the second half used for cross-validation 
purposes. The subjects in the initial validation 
effort were routed into four groups using four 
different sequential sampling procedures. One of 
the four routing methods, the sequential method, 
produced the fewest errors of classification and 
the highest overall conelation with the total SCAT 
and STEP test scores. The sequential method uses 
fewer items for those easy to classify and more 
items for those at the borderline of categories. The 
measurement test is constructed by obtaining the 
items with the 20 highest within-group point- 
biserial correlations (excluding the routing items) 
Computer based testing could facilitate this 
procedure because of speed, flexibility, con- 
venience, and immediacy of feedback. This 
method is especially suited to persons at the 
extremes of the distribution because they can be 
quickly routed and thus save time. One problem 
acknowledged by the authors, with this research 
effort, is that the SCAT and STEP items were 
taken out of context from a total test. This could 
have biased the results. 

Lord (1971a, 1971b) introduces a theoretical 
treatment of “tailored testing” which is a 
sequential testing procedure consisting of one 
rather than two stages. It is tailored in the sense 
that the items are those that are best suited to the 
individual being tested. “In tailored testing we try 
to choose items for administration that are at a 
difficulty level that matches the examinee's 
ability, which we infer from his responses to the 
items already administered. . . . when the 
examinee gives a wrong answer to an item, the 
next item administered should be an easier one; 
when he gives a correct answer the next item 
administered should be harder (Lord, 1971a, 
pp.34).” In his earlier work. Lord (1969) evolved 
a two-stage testing procedure using similar 
principles. 

Ferguson (1969) used a computer to select 
items on the basis of a student's prior responses. 
The computer will keep testing the student until 
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lie satisfies the criterion specified by tlie training 
objective. When the criterion is met, Jhe computer 
will route the subject to the next training objective 
containing items based upon the student^s profi- 
ciency on the first training objective. The program 
was successfully used with 75 elementary school 
students from the Pittsburgh area. 

According to Gagne (1967), if the curriculum 
units arc arranged liicrarchically, and the test items 
meet standard requirements, a hierarchical testing 
procedure will be implicit since most people wlio 
fail tlie lower unit will not pass the next higlier 
unit. Moreover, if persons who pass a lower unit 
fail on the next higher unit, an additional inter- 
spersed unit may be indicated. Obviously, this 
technique can also indicate whether or not some 
units have been reversed in the hierarchy of 
itistruction. 

Criterion- and Norm-Referenced Testing 

Glaser (1963) and his colleagues (Glaser & Cox, 
1968; Glaser & Klaus, 1962; Glaser 8l Nitko, 
1971), as well as Popham (1969), Carver (1970), 
and Holtzman (1971), have all written on the 
topic of criterion referenced versus norm- 
referenced testing. The characteristics of criterion- 
referenced tests are that they (a) indicate the 
degree of competence attained by an individual 
independent of the performance of others; (ft) 
measure student performance with regard to 
specified absolute standards of performance; (c) 
minimize individual differences; and (cf) consider 
variability irrelevant. 

Generally, from these statements, it can be seen 
that criterion-referenced tests tell how the student 
is performing with regard to a specified standard 
of behavior. Individual differences are considered 
irrelevant, since the student is graded against a 
single standard rather than against all the others 
taking the test. Assigning grades of competence to 
students on the basis of relative performance, 
when it is not really known whether any of the 
students have attained a specified behavioral 
objective, makes very little sense. One can, thou^, 
derive individual differences from criterion- 
referenced tests by specifying the degree of 
competence reached by each student. 

Simon (1969) thinks that there is no real 
difference between criterion- and norm-referenced 
tests. Whether a test is one or the other depends 
upon how the scores are used. 

Glaser (1963) and Glaser and Cox (1968) 
discuss the use of norm-referenced achievement 



tests and criterion-referenced tests in differentia- 
ting among individuals and treatment groups. 
When evaluating individuals, one needs to use an 
achievement test containing items witli different 
difficulty levels. For evaluating treatments or 
experimental conditions, though, one needs 
perfect post-treatment answers and incorrect pre- 
treatment answers so that the dependent measure 
is maximally sensitive to training change. In this 
latter case, criterion-referenced tests are most 
appropriate. 

K. Johnson (1969a, 1969b) suggests that train- 
ing evaluation sliould use criterion-referenced 
tests, but that they are costly and just not feasible 
for many training situ itions. Johnson’s purpose 
was to determine the degree which other measures 
(e.g., norm-referenced tests, student and instructor 
attitudes) can be used as substitutes for criterion- 
referenced tests. Reliabilities were calculated for 
three measures on four courses taught at the Naval 
Air Technical Training Center. In one course there 
was a comparison with criterion-referenced tests. 
The reliabilities for all three methods were fairly 
high, but a large number of items was needed (r.e., 
more than 20) to get an adequate reliability for 
norm-referenced tests. Student and instructor 
attitudes were highly correlated, but neither had a 
high correlation with norm-referenced tests. Each 
of the three measures accounted for 27 to 43 
percent of the variance of scores on criterion- 
referenced tests. Without defining what he con- 
sidered to be an adequate substitute, Johnson 
concluded ^.hat none of the other methods is an 
adequate substitute for criterion-referenced tests. 

Siegel, Schultz, and Lanterman (1964) and 
Siegel and Fischl (1965) sought to develop a 
criterion-referenced evaluation scheme for the 
Navy electronics technician rating. What is unique 
and interesting about these studies is that the 
criterion referencing was done in combination 
with Guttman scaling procedures. Their technique 
involved (a) assembling statements of the specific 
system objectives of Naval air electronics; (ft) 
weighting these objectives on the basis of the 
importance of their respective contributions to 
system requirements; and (c) psychophysically 
establishing cut points on a Guttman-type job 
performance scale, the cut points representing 
levels of skill required in order for each of the 
objectives to be met. The resultant Technical 
Proficiency Checkout Form Scales (TPCF) were 
found to correlate between .65 and .74 with 
perfoimance test scores. 
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Ratings 

Rating, althougli widely used, is one of the 
most unreliable, biased, and contaminated 
methods for evaluating performance. Several 
factors which can contribute to poor or in- 
adequate ratings are (a) friendship, (b) quick 
guessing, (c) jumping to conclusions, (d) first- 
impression responses, (e) appearance, (/) 
prejudices, (g) halo effects, (h) errors of central 
tendency, and (/) leniency. Of these, the last three 
are probably the most important. Halo exists when 
a rater allows liis overall, general impression of a 
man to influence his judgment of each separate 
trait on the rating scale. Errors of leniency occur 
when a rater tends to use only the upper portion 
of the rating scale when rating all or most of his 
men. Errors of central tendency occur when the 
rater uses only the middle portion of the rating 
scale when rating his men. Considerable evidence 
exists which demonstrates that rater training can 
reduce these sources of bias so that the resultant 
ratings are at least minimally useful (Bergman 8c 
Kujawski, 1969). 

Howard and Correll (1966) wanted to deter- 
mine if there was a consensus with regard to the 
acceptability oi various behaviors of psychological 
interns among those responsible for training them. 
The trainers were given a list of 27 critical incident 
statements and were asked to indicate whether the 
behavior described in the incident was charac- 
teristic of a beginning trainee, an intermediate 
trainee, or a senior trainee. In many instances, 
university based trainers used more lenient 
standards, and in other instances agency trainers 
used more lenient standards. There was, of course, 
some agreement across universities and agencies. 
Overall, some behaviors thought to be charac- 
teristic of beginners in one place were thought to 
be characteristic of senior trainees in another 
place. The authors concluded that more 
uniformity is needed because of the widely differ- 
ing standards of behavior. 

In another study, Edwards (1968) had the 
teachers from five nursing schools rate the per- 
formance of 55 of their senior nursing school 
students on their per f ormance under three 
conditions: (a) situations requiring interpersonal 
physical care; (b) situations needing technical 
skills; and (c) conditions requiring non-physical, 
interperson^ patient care. 

Evaluations were made by the operating room 
instructor, the medical nursing instructor, and the 
psychiatric instructor. All trainees were rated from 
A to E. The results showed that all interrater 
correlations were very low (.5 at most). The only 



fairly liigh correlations were within instructors 
across specialties. The authors 'ndi^ate that these 
unreliable results were caused by (a) teacher per- 
sonality, (b) relations with students, (c) differ- 
ential behavior of students, and (d) differential 
teacher criteria. The ratings also had a disappoint- 
ing relationsliip with test scores and grades within 
specialty. The ratings correlated -.01 to .27 with 
test scores and .20 to .49 with grades. 

Greer, Smith, and Hatfield (1967) constructed 
a standard system of checkpilot helicopter evalua- 
tion in order to overcome efTects of the check- 
pilots’ proclivity to rate on the basis of their own 
personal standards rather than on student flying 
skill. First, the training program was evaluated in 
terms of maneuver components. Specific profi- 
ciency scales and instrument observation were 
used as criterui instead of the checkpilot’s own 
schema. From this early work the Pilot Perform- 
ance Description Record (PPDR) was constructed. 
The PPDR consisted of items reflecting the most 
critical aspects of each maneuver. Fifty inter- 
mediate and 50 advanced helicopter students were 
each given checkrides with one research staff 
member and one checkpilot. Prior to this, some of 
the checkpilots were trained in the use of the 
PPDR to reduce checkpilot differences in scoring 
standards. The results showed that (a) the relia- 
bility of flight proficiency evaluations improved; 
(b) the PPDR recorded specific student defi- 
ciences; (c) checkpilots trained in use of the PPDR 
were more consistent in their evaluations than 
checkpilots who were only oriented in the PPDR; 
and (ri) checkpilot training is necessary when using 
the PPDR. 

In another study, Greer (1968) wished to 
increase the reliability of checkpilot ratings which 
typically averaged .20. Checkpflots were asked to 
complete an 11-item rating form. Those who 
agreed with an r of .90 or better were paired 
together with students; the resultant correlation 
was .65. 

Duffy (1968), Duffy and JoUey (1968), and 
Duffy and Anderson (1968) wished to develop an 
objective recording device to score student 
helicopter checkrides. The students were scored 
during and after training and on maneuvers. All 
data were recorded on IBM cards, and a class 
percentage error and a school average were 
tabulated. If certain types of errors tended to 
show up under one instructor in one aspect of 
trainiDg, the instructor was given additional in- 
structional training. If one checkpilot was found 
to be more strict than the otlier, he was also given 
counsel to make his ratings less strict. 
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Caro (1968) undertook a study to compare 
grades given by checkpilols and grades given by 
instructors before and after innovations in rating 
were introduced. A second study was performed 
to determine if grades were influenced by the 
checkpilot's relationship with the students or the 
instructors. To eliminate bias due to prior knowl- 
edge, 40 of 60 subjects were given checkrides by 
check pilots outside the classes studied. The 
principal results of concern from these two studies 
suggested that (a) there were high correlations 
between instructors and checkpOots from the same 
classes; (6) there was no relationship between in- 
structors and checkpOots from outside the classes; 
(c) student grades were affected by the individual 
standards of the checkpilot; (d) specific infor- 
mation was collected by the checkpilot on the 
student’s flight, but not systematically or con- 
sistently; and (e) there were no differences after 
the new grading procedures were introduced. 

Jenkins, Ewart, and Carroll (1950) sought to 
develop an index of combat effectiveness against 
which tests could be validated. They used the 
nomination technique which asks each man to 
name two with whom he would like to fly wing 
and two with whom he would not like to fly wing, 
together with the reasons for his choices (checked 
off on a 22-item checklist). Data were collected on 
2,2 7< high and 1,829 low and 228 mixed pilots. 
The resurjs showed that the nominations were 
related to the rank of the officer and that their 
reliability was .80. The reasons for the nomina- 
tions were more reliable for the lows than for the 
highs. Also, there was a different frequency of use 
of reasons for different ranks (e.g., senior officers 
more often avoided going on combat missions than 
junior officers). A factor analysis of the checklist 
data delineated several underlying factors: (a) 
sociability, (b) practical intelligence, (c) cool- 
headedness, (d) combat aggressiveness, (e) flying 
skill, (/) teamwork, (^) leadership (highs only), and 
(h) reaction to failure (lows only). A second order 
factor analysis resulted in two high factors 
(fighting ability and capacity for combat leader- 
ship), and three low factors (emotional 
inadequacy, fear-impulsive foolish, and lack of 
practical intelligence). All of the aforementioned 
factors were orthogonal. Those interesting results 
notwithstanding, the ratings failed to predict 
combat success, even with rank controlled. 

In another study, Yellen (1969) used co-worker 
or peer ratings as criteria of performance for field 
artillery crewmen. The multiple correlation 
between these ratings and a weighting of the major 
areas of a proficiency test was .71. 



In one final study (Flaugher, Campbell, & Pike, 
1969), white and black medical technicians were 
rated on job performance by both wl\Uc and black 
supervisors. White supervisors tended to rate tlie 
whites slightly higlicr than the blacks, while black 
supervisors rated blacks considerably higticr than 
whites. 

In summation, ratings tend to improve to the 
extent that the influence of the rater’s own idio* 
syncrasies are prevented from affecting his 
observation of subordinate behavior. The evaluator 
must observe and record behavior in objective 
terms. If this suggestion seems mechanistic and 
devoid of rater influence, it is meant to be that 
way. The more the rater can become like a 
behavioral metering device, the less likely he will 
contaminate the evaluation. Also, it will help 
immensely if the rating items are couched in 
behavioral rather than in relative or evaluative 
{e.g., above average) tenns. Finally, performance 
evaluations should not be tied to salary review 
unless they are to be used for that purpose. 

In general, ratings are much used and conven- 
ient although they are at best a haphazard method 
of evaluating training performance, student 
achievement, or job behavior. If other, more 
objective methods are feasible, they should be 
used. 

Cost Effiectiveness 

Alkin (1970) has written an extensive treatise 
on cost-benefit analysis. Some of his comments 
and suggestions are reviewed in the ensuing para- 
graphs. 

Generally, cost-benefit analysis is the analysis of 
the costs and benefits of various alternative 
courses of action. The decision maker selects the 
method giving the largest yield at a ^ven cost, or 
the most benefit for the least cost. Input and 
output must be measured in dollar terms. Cost- 
benefit studies are usually large-scale. For instance, 
the costs of college education can be compared 
with the resultant increase in productivity yielded 
by the college education. 

The manipulatable characteristics are the con- 
ditions whose variations maximize or minimize 
student output. The manipulatable characteristics 
which affect student output are (a) student inputs 
measuring the achievement starting point of the 
student; (6) financial inputs or the funds allocated; 
(c) external system which is the giver of inputs and 
the receiver of outputs (e.g., society); and (d) in- 
struction, supplies, tests, and similar items. 
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With regard to the outcomes of cost-benefit 
analysis, the analyst’s interest is in how the 
student has changed in short- and long-term ways 
{e.g., how well he deals with other schoolwork and 
his society). Although there are financial inputs, 
there arc no financial outcomes except those 
derived from behavior changes. There are also 
non-student outcomes which comprise items such 
as teacher salaries and number of personnel used in 
the program. 

Alkin sees three major problems in evaluating 
the cost effectiveness of manipulatable variables. 
They include (a) difficulty in getting accurate cost 
data; (b) difficulty . .in dealing with cost- 
effectiveness estimates in the light of system- 
interrelationships (p. 235);” and (c) problems in 
generalizing to specific individual cases. 

Hawkridge (1970) says that there are two 
evaluation loops regarding money allocated for 
educational programs. These two loops are the 
“philanthropic” and the “conservative.” As soon 
as money is allocated, many programs spring up. If 
the evaluation is done poorly or unreliably, then 
the money is cut back and the first thing the 
program administrator usually does is cut evalua- 
tion cost so he can keep other aspects of the 
program. One can, of course, stay in the philan- 
thropic loop if sound evaluation is performed. 

Gubins (1970) performed a cost-benefit anal- 
ysis of training programs for the hard core 
unemployed. In this case, cost-benefit analysis is 
based on the cost of unemployment and the gain 
from investment in these human resources. 
Gubins’ findings suggested the impact of increasing 
the number of hard core unemployed in govern- 
ment training programs; (a) Programs were still 
“economically efficient.” (b) There were greater 
gains by trainees with less than nine years’ educa- 
tion over trainees with greater than nine years’ 
education; therefore, the basic education portion 
of training is of most value, (c) Training was more 
beneficial for those less than 22 years of age than 
for those greater than 22 years of age. (d) Trainees 
gained financially after undergoing training. 

S. Allison 0969) developed a cost-estimating 
model for undergraduate pilot training. Inputs to 
Allison’s model consist of or can be (a) under- 
graduate pilot training graduation requirements, 
(b) course requirements, (c) instructor-student 
ratios, (d) administrative and support manpower 
relationships, (e) number of aircraft and simulators 
available, (J) quantity of facilities available, and (g) 
cost relation^ps. The model, given the inputs, 
computes the cost required for training in terms of 



research and development costs, investment costs, 
annual operating costs, and long-range feasibility 
estimates. 

The Ozarks Regional Commission presented a 
rather detailed account of their cost-effectiveness 
system (Manuel, 1970). The goal of the commis- 
sion is closing tlie “income gap” between the 
Ozark region and the rest of the nation. They 
wanted to measure tlie additional value of 
occupational education in the Ozark region. They 
saw their major problems as transposing the gains 
and losses into dollar terms. Benefits are calculated 
in terms of what buyers and users of tlie 
commodity will pay, or in terms of production 
costs if the former are not available. Costs consist 
of the value of the goods and services used up in 
the project as compared with iheir use for other 
purposes. This is called the value of alternate uses. 
If no alternate use exists, the costs are zero. 

Intangible costs and benefits cannot be put into 
dollar terms, but they can be quantified and 
compared in terms of alternate courses of action. 
If, among two projects, A gives more net benefits 
than B, but if B has intangible benefits which over- 
ride the net benefits of A, then B might be chosen 
as the course of action. 



Some of the Ozark commission’s cost- 
effectiveness formulae are presented: 
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Gain Scores and Final Examination 
Grades 

Carver (1966, 1969, 1970) presents a rather 
conclusive aigument against the use of gain or 
difference scores in evaluation research. The prob- 
lem in the before-and-aft^r measurement of gain 
scores is that when small significant increases are 
registered, there may actually be a tremendously 
large increase in knowledge. This paradoxical 
result comes from the inequality of measurement 
at different points along tlie scale. Carver hypoth- 
esizes that a curvilinear relationship exists between 
test scores and knowledge, with knowledge 
increasing faster than test scores. One can rarely 
find a significant positive correlation between 
initial test scores and gain scores (often there is an 
inverse correlation). This is contrary to expecta- 
tion, since it is expected that tlie more intelligent 
student will learn more and that the more in- 
terested student will be motivated to study more. 
One can partially explain this finding on the basis 
that students who already know a lot do not have 
much left to learn. Another related problem is the 
ceiling effect which occurs when the initially 
bright student already has most of the items on 
the pretest correct and does not have much room 
for improvement. Carver indicates that final 
examination grades constitute a dependent vari- 
able measure that is superior to gain scores, but 
with certain restrictions; The ratio of final knowl- 
edge to initial knowledge must be considerably 
greater than one; the correlation between initial 
knowledge and final knowledge must remain high; 
and the variance of final knowledge must be 
greater than the variance of initial knowledge. 

Carver (1969) offers another solution-one 
involving separation of the initially bri^t from the 
initially dull students. This is done to correct a 
motivation problem for the initially high scoring 
student who has to waste time completing items at 
a low level. It is possible that if the bright student 
started off at a higher level, his gain may have been 
greater. On this basis Carver concludes that final 
scores are the best, because of unacceptable 
solutions using functions of initial and final scores 
and because expectations are not confirmed about 
initially bright students. Guilford (1970), though, 
feels that absolute scaling methodology might 
offer a solution to this dilemma. 

Bereiter (1963) presents certain other related 
problems in the measurement of change: 

1. The “overcorrection— undercorrection 
dilemma” which occurs when there is a 



negative correlation between the initial score 
and the gain score. This can be corrected so 
that a positive correlation can exist between 
initial and gain scores. 

2. The “unreliability-invalidity dilemma” 
which occurs when tliere is a higl) corre- 
lation between pretest and posttest, thus 
lowering tlie reliability of die difference 
scores. If one obtains reliable difference 
scores because of a low pretest - posttest 
correlation, then the less we can say about 
the gain. 

3. The ‘‘physicalism-subjectivism dilemma” 
which involves the choice of the scale units 
given versus units conforming to psycho- 
logical meaningfulness. Bereiter recommends 
the use of terminal scores because change 
scores create too many problems. 

Confidence Testing and Partial 
Knowledge 

Shuford, Albert, and Massengill (1966) and 
Shuford (1967) have constructed a scheme u* 
provide for more adequate measurement of 
student knowledge than is possible with traditional 
testing methods. They feel that additional infor- 
mation is available from the student’s degree of 
belief probabilities. A mathematical system is 
presented which ensures that a student can maxi- 
mize his expected score if he truly reflects his 
degree of belief or probability that a specific 
response choice is correct. With the traditional 
procedure, using a true-false test as an illustration, 
the student assigns a different probability for each 
response depending on his state of knowledge. If 
the student sees the probability of true as beir^g 
greater than .50,,{ie should choose true; but if the 
probability is less than .50, he should choose false; 
if it is equal to .50, he can choose either response. 
Generally, a student with poor knowledge (p = 
.51) V ill get the same score (if correct) as the 
person with good knowledge (p = .90); therefore, 
the choice situation loses data about the student’s 
knowledge. In confidence testing, the student 
receives a confidence score (a function o ' proba- 
bility) if his answer is correct plus a score for the 
correct answer. In addition, tlie student can 
receive credit if he is certain that his response is 
incorrect and the response is, in fact, incorrect. In 
one study (Massengill & Shuford, 1969), using 
multiple-choice tests, confidence was divided 
among the choices to total 1.0. The subjects for 
this study were 26 college-level students. It was 
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found that the conPicIcncc ratings were highly 
related to the probability of tlieir answering the 
questions correctly. 

Gardner (1970) administered a course pretest 
using confidence estimates to 151 student instruc* 
tors. The test was designed to determine necessary 
training for these instructors, liven with the 
confidence scoring, tliere was no significant corre- 
lation of the pretest with practice teaching or with 
final class standing. The autlior still claims that 
confidence testing yields a better assessment of 
student knowledge, as well as higlier reliability. 

Coombs, Milholland, and Womer (1956) 
present another method of assessing additional 
student knowledge. Traditionally, in scoring a 
four-choice multiple-choice question, a subject is 
given a point for the correct answer and no points 
for a choice of any incorrect answer or distractor. 
Partial knowledge exists when the student can 
identify one or more of the distractors. Using this 
technique, in a multiple-choice format, one point 
is given for each distractor identified and three 
points are subtracted if the correet answer is 
identified as a distractor. Scores on each four- 
choice item can range from plus three to minus 
three. Partial-knowledge testing, then, yields 
increased item and test variance and penalizes for 
random guessing. Two possible disadvantages of 
this method are that it is not applicable to all 
kinds of tests (e.g:, true-false tests), and the 
scoring is time-consuming. 

Characteristics of Material to be 
Learned 

R. Allison (1960) gave 13 different learning 
tasks to 31 5 enlisted men at a United States Naval 
Training Center. Thirty-nine aptitude and achieve- 
ment measures were also administered. Rate, 
curvature, and speed during the first and second 
half of the task were used as criteria of learning. 
Using factor analytic techniques, Allison found 
that learning was organized in a multidimensional 
way. Therefore, he contended that learning is not 
a single trait, but contains several factors 
depending . .upon the psychological process in- 
volved in the learning task and the content of the 
material to be learned (p. hV).” Also, the aptitude 
and achievement measures had much in common 
with the learning measures, demonstrating that the 
ability to apply knowledge and the acquiring of 
knowledge a«e very similar. 

Naylor, Brigp<:, and Reed (1968) found that a 
primary task lacking is performed better in 
conjunction with a coherent or meaningful 
secondary task (monitoring) than in conjunction 



with a less meaningful or coherent task. TIu reforc, 
secondary task coherence can affect primary task 
performance in dual learning situations. 

Weitz (1962, 1964) determined that with 
different difficulties of independent variables (c.^., 
amount of information given in a training task), 
the maximal effect on transfer of training will 
occur either early or late during the trials. For easy 
infomiation the maximal effect occurs early and 
for difficult information the maximal effect occurs 
lat^. 

Underwood (1969, 1970) performed several 
learning experiments which demonstrate a break- 
down of tlie total-time law which states that the 
amount learned is a function of total study time 
Eleven experiments were performed, each varying 
the frequency of massed and distributedV'^actice. 
The results showed that (a) recall of disiributed 
practice was always greater than recall of massed 
practice; (b) massed practice words which were 
presented with the same exact frequency as dis- 
tributed practice words were judged to have been 
presented les.» frequently; and (c) the difference 
(in recall) between massed and distributed practice 
increased as the frequency of repetition increased. 
Underwood hypothesizes that the difference 
between massed and distributed practice could be 
due to a failure of reception under massed practice 
which resulted in learning as if under a less 
frequent rate of presentation. 

Jensen (1971) gave two groups of high school 
students equivalent forms of a visual and auditory 
digit span test. Both forms were administered to 
both groups in a counterbalanced order under 
immediate and 10-second delayed recall condi- 
tions. Jensen found that auditory memory was 
better than visual memory for immediate recall, 
but that the reverse was found for the 10-second 
delay condition. 

Rather than viewing instruction as merely 
presentation of information, Whitmore (1970a) 
feels that it is a way of controlling student 
behavior so that learning takes place. Some factors 
which affect verbal learning are (tf) attention span, 
(b) organization of the material into meaningful 
units, and (c) sequencing of material (eg., hierar- 
chical, whole part, and general specific). 

Carkhuff (1969) concluded relative to coun- 
sellor training that . . those programs in which 
high-level functioning trainers focus explicitly 
upon dimensions relevant to helper gains and make 
systematic employment of all significant sources 
of learning, including, in particular, modeling, are 
most effective (p. 244)/’ 
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Composition Scoring 

Fostvedt (1965) constructed several criteria for 
the evaluation of higli school Engll ’ compositions 
in order to correct for non-uniformity of evalua- 
tion standards across teachers. Several sources 
were used to formulate the criteria: (a) coherence 
and logic, (b) development of ideas, (c) diction, 
(d) organization, and (e) emphasis. A sample of 
college English experts (A^ = 9) ranked these cri- 
teria. Kendall’s coefficient of concordance was .75 
(p < .01), indicatint; agreement amon.; the experts 
as to the importance of each criterion. Next, 30 
English teachers were asked to grade 20 themes as 
“above average,” “average,” or “below average” 
on each criterion. Analysis of variance was used to 
test criterion reliability, and the result was not 
statistically significant (p > .05); therefore^ differ- 
ent teachers graded the same themes differentlv. 
Chi-square tests also demonstrated no agreement; 
hence, the criteria were not reliable when used for 
grading purposes. 

Bushan and Ginther (1968) feel that there is a 
good deal of personal bias in grading essays and 
that a more objective method is needed. Differ- 
entiating between essays should take into account 
“. . . the structure and length of the sentence, 
vocabulary, and length as well as sociological and 
psychological construct of the test (p. 417).” A 
computer program was used which read off and 
quantified several relevant, scorable variables on 
li University of Chicago essays which were also 
graded by three experts. The three best and three 
worst essays were then coded for the computer 
and so analyzed. Thirteen criteria were employed 
to determine differences. After the differences 
were ascertained, these were used on the rem^ing 
five essays. Overall results demonstrated that 
better essay writers (a) have a larger vocabul^ ; 
(b) include statements cf other authorities who are 
named; (c) give exact dates for events; (d) use 
numbers for q^inulies; and (c) use fewer words 
from psychological categories that can be analyzed 
for personality differences. 

Testing 

Much of the previous discussion in this chapter 
has been concerned with various applications of 
testing. In this section, testing in the pure sense is 
discussed. 

Paper-ancl-pencil tests, as the name implies, are 
tests which the examinee takes with a printed test 
and a pencil. Most tests of this type require at least 
some reading ability. Some types of paper-and- 
pencil tests, though, require no reading ability at 
all. Many perceptual speed ?nd perceptual motor 



tests are available on the market. Users of percep- 
tual tests feel that tliey are related to some 
performance aspects of jobs. The verbal type of 
paper-and-pencU test should be used only in jobs 
which are primarily verbal or cognitive in content. 
It would probably be inappropriate to give a 
paper-and-pencU intelligence test or a vocabulary 
test to a person applying for a mechanical trade. 
Such tests, however, would be appropriate for 
some clerical positions. In performance tests 
(Danzig & Keenan, 1956; Fiske, 1954), the trainee 
or employee is asked to perform some tasks in 
which the content is relevant to his present or 
future job. Some performance tests are. less 
obviously related tojobs than others. Performance 
tests cart range from dominoes, mazes, and puzzles 
to performance of job tasks using real job equip- 
ment. Perhaps the most sophisticated type of 
performance test is the proving ground. In the 
proving ground (MeSheehy, 1959), the trainee is 
placed on the job. An attempt is made ‘to cycle 
him tiuough all the job tasks in a short period of 
time. As he performs each task, the i;ainee is 
evaluated and he, in turn, evaluates the training in 
relation to the job. 

Statistical Methods 

There are a number of little used and less 
understood quantitative methods which can be 
useful for training evaluation and student achieve- 
ment measurement. 

Partial Correlation and Part 
Correlation 

Partial conelation, according to EhiBois(1957) 
is . . the Pearson product-moment correlation 
between two sets of residuals, from both of which 
variance associated v/ith the same set of independ- 
ent variates has been eliininated (p. 192).” In 
actual practice partial correlation is used to hold 
one or more extraneous or contaminating variables 
constant. For example, in calculating the corre- 
lation between height and weight, one might wish 
to hold age and sex constant. Part correlation, on 
the other hand, is “defined as the Pearson 
product-moment correlation between a set of 
residuals on one hand and an unmodified variable 
on the other. . . ” In studies of learning, for 
example, it may be pertinent to inquire into the 
degree to which final standing in some jkil),le$s 
the variance related with initial standing, is related 
to some outside predictor variable (p. 60).” The 
use of this statistic (part correlation) will help to 
clarify some of the problems associated with the 
use of raw gain scores mentioned earlier in this 
chapter. 
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Factor Analysis 

Factor analysis is simply a statistical method 
for eliminating the redundancy present in correla- 
tion matrices. One might, for example, be able to 
reduce a 20 by 20 correlation matrix to a 20 by 5 
factor matrix, thus using only Five factors rather 
than 20 items to describe the matrix. 

Obviously, factor analysis can be a useful tool 
in training evaluation and student achievement 
measurement. For example, one might have a 15- 
item rating scale wnich measures on-the-job 
behavior of training school graduates. It would be 
inappropriate to describe the on-the-job behavior 
of these men in terms of either 15 separate 
dimensions or one overall composite when the 
15-item rating scale mi^t be reduced to three or 
f o u r d i mansions which more parsimoniously 
describe on-the-job behavior. If predictor tests 
were used, then, significant validity coefficients 
might be dependent upon whether or not one used 
factor analysis. Bergman (1970) had such an 
experience when attempting to predict the 
behavior of 139 oil company salesmen. 

Another old technique, but one which will 
probably be used more frequently during the next 
decade, is Q-factor analysis. In performing a 
Q-factor analysis, one simply factor analyzes the 
matrix of person correlations rather than item 
correlations. This method can be useful for 
grouping persons who thinl: or behave simOarly. 
For example, when constructing a training pro- 
gram, it may be useful to know the different 
cognitive styles of the potential trainees so that 
the training can be adapted to the needs of each 
homogeneous group. Eddy, Glad, and Wilkins 
(1967) used Q-factor analysis and found that their 
training program differentially affected 

. . .students depending upon their own goals, 
attitudes, ?*.nd characteristics and of their woik 
environments (p. 23).” 

Tucker (1966) recently presented a rather 
unique application of factor analysis to the 
measurement of student learning His iimovation, 
though, has undeservedly been ignored by all but a 
few members of the behavioral science com- 
munity. Using the Ekhart-Young thoorem (a 
fundamental matrix decomposition theorem of 
factor analysis). Tucker foxid that individuals 
learn in qualitatively different ways over trials 
such that individuals can be grouped or clustered 
according to the way they perform or learn. 
Tucker would not use a single, homogeneous learn- 
ing curve to describe what is, in fact, a heterogene- 
ous phenomenon. 



Canonical Correlation 

Canonical correlation is an extension of factor 
analysis to the situation in which two separate sets 
of variables exist. The first canonical correlation is 
the highest correlation between a principal com- 
ponent of the first set of variables with a principal 
component of the second set of variables. The 
second canonical correlation is the correlation 
between a second principal component of the first 
set of variables with a second principal component 
of the second set of variables. Canonical correl- 
ations are continually extracted until all the 
common variance between both sets of variables is 
accounted for. The method is most applicable 
when there are two separate sets of variables: for 
example, one set of predictor variables and one set 
of criterion variables. 

Moderator Variables 

A test is a moderator when its score differen- 
tially determines the predictability of another test 
or variable. For example, one may be able to 
adequately predict the performance of college 
students using an intelligence test for those who 
score high on a test of achievement motivation, 
but not for those who score low on the test of 
achievement motivation. Race is one of the more 
currently popular moderator variables. Much 
recent research has shown that employment tests 
are differentially predictive across racial groups, 
thus supporting the contention that common 
selection standards for Negroes and whites are 
inappropriate or unfair. Moderator variables are 
sufficiently important to student achievement 
measurement and training evaluation that they a/e 
given separate treatment in another chapter of this 
review. 

Convergent and Discriminant Validity 

Campbell and Fiske (1959) would define con- 
vergent validity as a high correlation between tests 
purporting to measure the same thing, while dis- 
criminant validity would refer to independence of 
tests measuring different factors. The one criterion 
for convergent validity is that the correlations 
between several tests measuring one trait must be 
significantly greater than zero (mono-trait hetero- 
method correlation). For discriminant validity, 
three criteria must be met: (a) The sin^e-trait- 
multimethod correlations must be significantly 
greater than the correlations not having trait or 
method in common; (b) the singje-trait- 
multimethod correlation should be significantly 
higher than different traits measured by the same 
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method; and (c) there should be a stable pattern of 
trait interrelationship regardless of the method 
used. 

Campbell and Fiske advocate the use of a 
multitrait-multimethod matrix which is in reality 
confusing and unnecessary, since all that is 
required is understanding of the concepts involved. 
Dielman and Wilson (1970) and Kavanagh, 
MacKinney, and Wolins (1971) are among those 
who have successfully applied this technique. 

Internal and External Validity 

Campbell and Stanley (1963) define internal 
validity as “significance,” and external validity as 
measured change in job behavior. Campbell, 
Durmette, Lawler, and Weick (1970) indicate that 
internal criteria are those that are directly tied to 
training behavior and that external criteria meas- 
ure subsequent change in job behavior. 

Campbell and Stanley (1963) and Winch and 
Campbell (1969) provide an exhaustive list of 
“threats” to internal and external validity. The 
threats to internal and external validity are (a) 
history or antecedents, (6) maturation of subjects, 
(c) testing effects, (</) instrumentation, (e) statis- 
tical regression (extreme scores), (/) differential 
selection of comparison groups, (g) experimental 
mortality, (h) selection-maturation interaction, (/) 
pretest sensitization. (/) interaction between selec- 
tion bias and the experimental variable, (/:) 
instability and unreliability of measures, (/) condi- 
tions making the experimented setting atypical or 
artificial, (m) multi-treatment interference, (n) 
inelevant components of complex measures, (o) 
failure to repUcate entire relevant parts of the 
experiment, (p) effects of experimental arrange- 
ments, and (q) effects of prior treatments. These 
writers recommend the use of experimental 
designs and statistical treatments which minimize 
the effects of these variables. 

To assess effects of training, Campbell, 
Dunnette, Lawler, and Weick (1970) recom- 
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In this design, the placebo group is necessary 
because the measureable effects of training can be 
attributed to the “Hawthorne effect.” The post- 



test group (IV) is needed to avoid the possible 
effects of pretest sensitization. 

Scaling Techniques 

Siegel and Schultz (1960), Siegel, Schultz, and 
Benson (1960), and Schultz and Siegel (1961a, 
1961b) report the use of scaled behavioral check- 
lists to evaluate job performance in several Naval 
job specialties. ITiese lists, developed on the basis 
of liiurstone and Guttman scaling principles, 
allow one to evaluate a man’s proficiency by 
checking just one task on a list. If he can perform 
that task, it can be assumed that he can perform 
an tasks below that level on the scale. 

Stone and Sinnett (1968) sought to determine 
whether or not the four-point grade point average 
distribution can be represented as being an equal 
interval scale. Thirty-six members of the Univer- 
sity of North Dakota were used as judges. The 
grade range of A to F was divided into 12 
intervals, eg, F to F^, F"*" to D”, D” to D, D to 

D+ A- to A. The judges were then asked 

to choose the grade intervals tliey thought were 
larger. They used the paired -comparison technique 
to rank aH intervals. The median coefficient of 
consistency for all judges was -.83 A scale was then 
constructed using Thurstone techniques. The 
results of this scaling analysis were that (a) the 
judged scale was found to be a logarithmic scale 
vdiich could be compared to the grade point 
average scale; (b) generally, the intervals were 
judged to be smaller as the grade levels decreased; 
(c) the midpoint of the scale was between and 
B~; (d) the distance between the midpoint of the 
grade to the (+) point appeared larger than from 
the (-) point to the midpoint; and (e) intervals 
contaiiiing a grade boundary were judged larger 
than those within a grade (e.g., to B“ was 
thought greater than C to C"^. 

Schultz and Siegel (1962a, l%2b) used multi- 
dimensional scaling analysis which integrates 
psychophysical judgments and factor analysis. The 
procedure is “ . . . obtaining a matrix of inter- 
stimulus distances (psychophysical judgments) 
and . . . determining the dimensionality of the 
space containing the stimulus points (p. 3).” This 
method recognizes the multidimensionality, as 
opposed to the unidimensionality, of job perform- 
ance criteri«^ Eighteen tasks performed by the 
avionics electronics technician were delineated. 
Judges were then required to indicate, along a 
scale, the distance or similarity between all 
possible pairs of tasks. After the analysis was 
completed, four job dimensions were found: (a) 
electro-comprehension, (b) equipment operation 
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and inspection, (c) electro-repair, and \d) electro- 
safety. Schultz and Siegel (1964) then used these 
four dimensions to construct unidimensional scales 
via Thurstone and Guttman techniques. Siegel and 
Schultz (1963) and Schultz and Siegel (1963) also 
applied multidimensional scaling analysis to classic 
fication of circuit types and to the Naval aviation 
electronics technician supervisor rating. 

Signal Detection 

Siegel and Pfeiffer (1969) and Siegel, Fischl, 
and Pfieffer(1968) were successfully able to apply 
signal detection theory to the prediction of 
academic success in both a military and a college 
setting. Signal detection theory “ . . . provides a 
way of controlling and measuring the criterion the 
observer uses in making decisions about signal 
existence and provides a measure of the observer 
detection sensitivity (d^) that is independent of his 
decision criterion (p. 145)/’ Eighteen subjects in 
Naval electronics training were divided into 
journeyman, intermediate, and advanced levels of 
training. Also, 40 male college sophomores were 
divided into high grade point average (2.88) and 
low grade point average (1,67) groups. The college 
sample was given a 49-item (psychology) true-false 
test, and the milUrry sample was given a 23-item 
(circuitry) test, items that are answered true are 
considered signal while items answered false are 
considered noise. A sensitive observer is one who 
differentiates with few errors between sign*^' ad 
noise. The results of this study were that (a; .. 

was 2.16 for the high grade point average students, 
and 1.58 for the low grade point average students; 
(b) Naval technicians with the least training and 
experience had a d^ of .64, while those witSi the 
most tiaining and experience had a d^ of 3.20; (c) 
analysis of variance results were significant for 
both groups at p < .01; (d) Scholastic Aptitude 
Test (SAT) scores were related to the college 
sample grade point averages; (e) other academic 
predictors did not correlate significantly with d^ 
suggesting that it measures a different basic 
process; (/) SAT scores accounted for 16 percent 
o»f the high grade point average variance and 13 
percent of the low grade point average variance; 
but with the addition of d^ the predictable vari- 
ance increased to 33 percent and 5^ percent, 
respectively; and (g) the variance accounted for by 
the military tests was 11 percent, but it increased 
to 50 percent with the addition of d^ The authors 
conclude that d' can be used both as a predictor of 
performarce and as a measure of training success. 

The theory of signal detection bears an obvious 
relationship to the previously mentioned concept 



of c onfidence testing. Test scores based on 
confidence testing should correlate higher with 
signal detection variables (d^ than with traditional 
test scores. Indeed, several investigators (Clarke, 
1964; Pollack & Decker, 1964) have used confi- 
dence estimates in their signal detection studies. 
Signal detection, multidimensional scaling, and 
confidence testing all derive from experiments 
based upon psychophysical principles which are 
discussed in the next section. 

Psychophysics 

Siegel and Federman (1970) combined the 
magnitude estimation technique with peer group 
ratings to arrive at a novel metliod of performance 
evaluation. The subjects for this experiment (A^ = 
20) were two groups of 10 avionics technicians. 
Each man was asked to estimate the number of 
uncommonly ineffective and uncommonly 
effective performances across nine performance 
dimensions for the nine other men over a specified 
period of time. The ratio of the number of uncom- 
monly effective (UE) performances divided by the 
number of uncommonly effective performances 
plus the number of uncommonly ineffective (Ul) 
performances (SUE/EUE + SUI) yields an index 
which varies between zero and one. One of the 
two groups was more experienced than the other, 
and this technique was able to differentiate 
between them. 

In addition to the aforementioned study, Siegel 
and his associates at Applied Psychological Serv- 
ices have over the years applied the Jassical 
psychophysical methods to several other aspects of 
military and performance evaluation. Terminal 
threshold concepts were applied to electronics 
troubleshooting performance evaluation (Siegel, 
1968). Psychophysical methods were used to 
maximize the probability of operator malfunction 
recognition (Miehle & Siegel, 1967). Activity 
circuit interactions were related to perceived 
circuit complexity (Pfeiffer & Siegel, 1967b). 
Magnitude estimation and the structure of intellect 
model were used to relate electronics maintenance 
job activities and the intellective scale values of 
these activities (Pfeiffer & Siegel, 1967a). The 
psychological relationship between perceived 
circuit complexity and a physical measure of 
circuit complexity was ascertained (Pfeiffer & 
Siegel, 1966). Magnitude estimates of perceived 
circuit complexity were related to subjective and 
objective job correlates (Siegel & Pfeiffer, 1966b). 
Magnitude estimation was used to measure 
avionics maintenance personnel subsystem relia- 
bility (Siegel & Pfeiffer, 1966a). And, finally. 
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magnitude and category psychophysical scaling 
methods were used by journeyman electronics 
personnel to scale the complexity of various 
aspects of their own jobs (Pfeiffer & Siegel, 1965). 

Summary 

The first section of this chapter presented an 
overview of some of the kinds and characteristics 
of dependent measures used in training evaluation 
and student achievement measurement. The test 
construction portion of this chapter contained a 
brief discussion of the steps to be followed in 
constructing a test plus some studies using novel 
tests or testing techniques. Other topics reviewed 
in this chapter were (a) hierarchical and sequential 
testing, (b) criterion- and norm-referenced testing, 
(c) performance evaluation problems, (d) cost 
effectiveness, (e) gain scores and final examination 
grades, (/) confidence testing and partial knowl- 
edge, (g) characteristics of the material to be 
learned, (h) composition scoring, and (/) statistical 
methods. 



IV. LEARNING STYLES AND MODERATOR 
VARIABLES 

Scope of the Problem 

The sensitivity and predictive power of student 
measurement and training evaluation techniques 
can often be increased through the use of modera- 
tor variables. This is because certain attributes of 
select groups tend to make the testing evaluation 
methods more or less appropriate for the groups. 
Some of the factors which can be used as modera- 
tors are (tf) achievement level, (b) personal and 
environmental variables, (c) social background 
factors, (d) cognitive style, and (e) affective 
reactions. 

Cognitive styles are modes of thought, percep- 
tion, and memory; they are also information 
processing habits. Some of the various types of 
cognitive styles that have been identified are (a) 
field dependence-independence, (b) attention span 
(or span of awareness), (c) breadth of categorizing 
(e.gi, lumpers and splitters), (d) conceptual styles 
(e.g, modes of categorization), (e) complexity 
versus simplicity in word perception, (/) 
reflecrive-impulsive, (g) leveling versus sharpening, 
(A) susceptability to cognitive interference, and (i) 
abQity to accept unrealistic experiences. French 
(1963), using a factor analytic approach, delin- 
eated two types of problem solvers: (j) those using 
a systematizing approach and (b) those using a 
scanning approach. 



Rundquist (1969) contends that item analysis, 
factor analysis, and moderator variables have not 
helped to increase predictive efficiency because 
these various methods fail to take into account the 
fact that different antecedents can produce the 
same behavior across individuals (e.g.> visual recall 
via eidetic imagery or by short term memory). 
According to Rundquist, one must learn the 
mediating processes used by individuals in learning 
to do a job and then construct tests for the ante- 
cedent behaviors. These new tests would be better 
measures of an ability than more global tests, and 
they could avoid confounding effects. The new 
test or measure may be slanted more toward one 
antecedent than another, thus increasing the 
validity coefficient. 

The overall trend towards individualization has 
caused some writers (Whitla, 1969) to plead for 
more research on student types, class mix, and the 
disadvantaged. Others (Bligh, 1965) have called for 
increased differentiation of norms for different 
groups (e,g, sex, race, locale). Finally, some others 
(Project Impact, 1970) claim that computer 
assisted instmetion and other forms of individ- 
ualized instruction are the best way to account for 
broad student differences. 

On the debit side, Gagne (1968) disputes the 
existence of learning styles. He thinks computer 
assisted instruction puts too much stress on the 
machine rather than on the student. He does, 
though, emphasize the need for individualized 
instruction, and he acknowledges the idiosyncratic 
nature of the student. Cohen (1970) feels that one 
must be careful when using cognitive styles as 
moderators and instructional aids, since they can 
change over time. For example, much of Piaget’s 
work has shown that the child’s problem solving 
style and conceptual mode of thinking will 
qualitatively change from infancy to adulthood. 
Cohen concludes that a valid decision about an 
individual’s cognitive style at one time may prove 
to be invalid at another time. 

One final note concerns the special case of the 
moderator variable approach when aptitudes or 
aptitude test scores interact. When ^is occurs, 
differential treatment of groups is mandatory. If 
not, erroneous or contaminated results will occur. 

Motivation and IVpes of 
Inteligence 

There has been a plethora of recent research 
emphasizing the effects of differential motivation 
and differential thinking styles (erroneously 
termed “intelligence”) on student achievement. 
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These concepts certainly should be held in mind 
by anyone concerned with student achievement, 
from either the measurement or the instructional 
point of view. However, the payoff of the studies 
in these areas seems, as yet, indeterminate and 
problematical. Many of the studies are contradic- 
tory in results, and others require cross validation 
before their indications can be fully exploited. 

Jensen (1969) postulates that there are two 
types of intelligence, abstract and associative, and 
that instruction and testing should be differen- 
tially tailored to suit these different modes of 
learning. 

Rimland (1969) also suggests that there are two 
types of intelligence, practical and abstract. 
Rimland hypothesizes that practical intelligence is 
needed for job performance, and that abstract 
intelligence is needed for academic work. Such 
thinking would imply that most trade schools 
should rely heavily on job performance testing to 
measure student achievement. Rimland says that 
the traditional g, or general intelligence factor, 
measures “intracerebral events,’’ or the ability to 
abstractly manipulate symbols and events in the 
head. This is the ability required of test takers. 
Others are better at “extracerebral events,” or the 
ability to sustain attention on and perform simple 
tasks which simulate the job perceptual 

speed). Rimland posits that these two types of 
intelligence are mutually exclusive. In his research, 
he found that intelligence test scores correlated 
much hi^er with school grades than did perform- 
ance test scores, but that performance test scores 
correlated much higher with job performance than 
did intelligence test scores. He concludes that 
different types of training and separate types of 
measurement are needed for students with differ- 
ent types of intelligence. 

Rotter (1966) conceives the effect of reinforce- 
ment on behavior as dependent on whether the 
person perceives a causal relationship between his 
own behavior and the reward. If not, the result is 
attributed to luck or to the control of others. 
Internal control exists when the student thinks 
reinforcement is contingent upon his own 
behavior, while external control is when the 
student thinks reinforcement is controlled by 
others or by chance events. 

In one study investigating the internal-external 
control thesis (Scott & Phelan, 1969), three groups 
of hard :ore unemployables were tested with 
Rotter’s intemal-Extemal Control Scale. The 
sutqects in ;ill three groups were matched on age, 
socioeconomic status, and scholastic aptitudes. 



The results demonstrated tliat black and Mexican 
American subjects demonstrated greater external 
control than did white subjects. The authors 
concluded that the externally controlled subjects 
did not feel that there was a relationship between 
individual effort and reward; therefore, they did 
not work unless given external reinforcement (e.^., 
praise, money). 

Atkinson (1966) presents a somewhat more 
vigorous theory of motivation involving achieve- 
ment motivation, incentive, and goal expectancy. 
Atkinson’s theory is depicted by the formula: 

Motivation = f(motive x expectancy x incentive) 

With motivation to approach a goal (nAch) held 
constant at 1.00 and with expectancy and incen- 
tive equal to .5, then the probability of goal 
approach is .25 (the highest possible). Atkinson 
defines incentive as the goal attractiveness, and 
motive as the ability to strive for satisfaction or to 
accomplish. “The strength of motivation to 
approach decreases as probability of success 
increases from .50 to near certainty {p^ = .90), and 
it also decreases as decreases from .50 to cer- 
tainty of failure (Ps ^ • 10) (P- 17).” 

From this formulation, it is easily seen that the 
young, deprived black child will rarely encounter a 
probability of success of .5 or greater. Because he 
perceives a certainty of failure, he then lacks the 
motivation to approach a goal; therefore, he does 
not perform as well in student measurement situa- 
tions as the non-deprived white child who 
perceives a higher probability of success. 

Katz (1967) more or less integrates the two 
earlier theories into a coherent two-stage theory of 
development which possesses implications for 
student measurement. During the first stage (up to 
two years of age) of development, the child’s 
verbal efforts are normally reinforced by parental 
approval. Selective approval, on the. part of the 
parents, can develop strong habits of striving for 
proficiency in the child. During the second stage, 
the parental standards and values of achievement 
are internalized by the child. “The child’s own 
implicit verbal responses acquire through repeated 
association with the overt responses of the parents, 
the same power to guide and reinforce the child’s 
own achievement behaviors .... Internaliz- 
ation doesn’t take place until strung externally 
reinforced achieving habits have developed (p. 5).” 
Lower class children (including most blacks) are 
more dependent upon others for social reinforce- 
ment in academic situations. Lacking internaliza- 
tion, they will avoid achievement situations and 




concentrate on other situations regarded as more 
promising. ‘‘Lower class Negro children tend to be 
externally oriented in situations that demand 
performance. That is, they are likely to be highly 
dependent on the immediate environment for the 
setting of standards and the dispensing of rewards 
(P- 8).” 

Hess and Shipman (1965) present a very 
interesting and alternative developmental 
formulation. They feel that cognitive growth is 
“ . . . fostered in family control systems wliich 
offer and permit a wide range of alternatives of 
action and thought and that such growth is con- 
stricted by systems of control which offer pre- 
determined solutions and few alternatives for 
consideration and choice (p. 870).” In the 
deprived family context, the parent-child control 
system “ . . . restricts the number and kind of 
alternatives for action and thought that are opened 
to the child; such constriction precludes a 
tendency for the child to reflect, to consider and 
choose among alternatives for speech and action. 
It develops modes for dealing with stimuli and 
with problems which are implusive rather than 
reflective, which deal with the immediate rather 
than the future, and which are disconnected rather 
than sequential (pp. 870-871).” Hess and Shipman 
performed a research study using deprived (black) 
and non-deprived mother and child pairs which 
supported their hypotheses. These authors 
concluded that the family shapes the modes of 
communication in the chfld, which in turn shape 
his thought and problem solving style. 

In summation, these four positions suggest that, 
in both curriculum development and student 
measurement, differences in cognitive style and 
motivation must be accounted for in any program 
which purports to be at all comprehensive. 

Race and Aptitude as Moderator 
Variables 

In a recent survey of 13 studies, Boehm (1971) 
found that job knowledge and performance test 
criteria always yielded the highest validities. 
Generally, there are fewer validity differences 
between racial groups when these more objective 
criteria are used instead of ratings or rankings. 

McFaim (1969a, l%9b) noted that the differ- 
ences between high- and low-aptitude men in Basic 
Combat Training were greatest on cognitive tasks, 
and that the differences were not as marked on 
motor skills and proficiency tests. In a project 
SPECTRUM study, high-, middle-, and low- 
aptitude groups were selected, and individualized 



training was given using videotape, one-to-one 
student-teacher ratio, feedback, reinforcement, 
and small increments. In some task:>, low-aptitude 
men reached standard but took two to four times 
longer; in other cases they did not master the 
material at all. McFann also found tliat high- 
aptitude groups learned equally well with lecture 
or individualized training, while low-aptitude 
groups learned well with individualized training, 
but not with lecture. 

Foley (1971) wanted to determine if the 
Officer Qualification Test (OQT) was biased 
against blacks in determining final Officer Candi- 
date School (OCS) grade point averages. The final 
OCS grades of blacks from Caucasian colleges were 
not significantly different from a matched white 
sample. Blacks from Negro colleges, though, did 
receive significantly different grades than their 
matched white subjects {p < .005). In general, the 
OQT predicted better for the white sample, even 
though it was significant for both races. 

Guinn, Tupes, and Alley (1970a, 1970b) 
wished to determine if the prediction of training 
success varied across subgroups. If this is the case, 
then overall predictive efficiency suffers. These 
writers found differences in training performance 
across race, area of the country, and -education. All 
three differences, though, were not found in all 
occupational specialties. It can be inferred from 
these results that factors such as race and vari' 
ations in cultural opportunity, as may exist across 
different education^ and regional groups, can 
account for the differences in test scores across 
groups. 

In a study performed at the American Tele- 
phone and Telegraph Company (Grant & Bray, 
1970), task proficiency after training was used as a 
criterion because the investigators thought that it 
was uninfluenced by supervisory bias, peer pres- 
sure to control output, and motivation. 

Five hundred subjects, both blacks and whites, 
who met and failed to meet normal selection 
standards were involved. Seven hierarchical levels 
of training were employed using tasks regularly 
performed by craftsmen. Pretest and posttest tasks 
were given at each level, and the highest level com- 
pleted was the criterion. The results demonstrated 
that all selection instruments correlated with 
highest level passed, and there were no differences 
in minority and non-minority correlations. The 
School and College Abilities Test plus a test of 
abstract reasoning yielded a multiple R of .49 
when correlated with the training criterion. 
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Age and Sex as Moderators 

Using the Gates Reading Readiness Test and the 
Metropolitan Achievement Test for elementary 
school students, Miller and Norris (1967) found 
that younger school entrants were at a disadvan- 
tage at the start. This effect, though, disappeared 
after the first grade. The late entering group 
tended to have more achievement and psychologi- 
cal referral problems than the early and normal 
entrant group. 

Gay (1969) investigated the differential effect- 
iveness for males and females of three computer 
assisted instruction (CAI) treatments on delayed 
retention of mathematical concepts. The three 
methods of presentation were (a) “variable 
example” which depends on the subject’s pre- 
instruction retention index as measured by the 
Gay Retention Index; (6) “choice” which ^ows 
the subject to decide on how many examples he 
needs; and (c) “fixed” which allows the subject 
three trials per mathematical concept. Fifty-three 
eighth grade subjects (27 male and 26 female) 
were randomly assigned to the treatments. The 
results indicated that (a) the females in the vari- 
able example group performed better than the 
females in the fixed and choice example groups (p 
< .05); (6) males in the choice group performed 
significantly better than females in the choice 
group (p < .05); (c) males in the choice group 
performed significantly better than males in the 
variable example and fixed groups (p < .05); and 
(cf) females in the variable example group 
performed better than males in the variable 
example and fixed groups. Gay concluded that the 
choice method is best for males. Even though the 
males averaged three choices, they gave more trials 
to the difficult items and fewer trials to the easier 
items. The Gay Retention Index, though, seemed 
to be good for selecting the number of items for 
females. 

Cross-National Evaluation 

Husen (1969) discusses cross-national evalua- 
tion and points out that such evaluations can be 
confounded because of a difference in objectives, 
which are different across boundaries, including 
different traditions, emphasis, age levels of intro- 
duction, and opportunity. Husen also points out 
that the real purpose of cross-national evaluation is 

. . not to make overall comparisons between 
countries - we are not engaged in an international 
contest — but to obtain meaningful comprehensive 
measures of both cognitive and non-cognitive out- 
comes and to relate these to a comprehensive set 
of input variables, including those which measure 



opportunity. Thereby, provisions are made for a 
fruitful multivariate analysis of how outcomes are 
related to inputs (p. 343).” 

Summary 

This chapter was concerned witli the various 
effects of learning styles and moderator variables. 
First, moderator variables wore defined and 
discussed. Following this was a presentation of 
several motivational and developmental tlieories 
which purport to lend some insight into how 
moderator effects materialize. Additional sections 
of the chapter contained studies of race and apti- 
tude levels as moderator variables; age and sex as 
moderators; and problems of cross national evalua- 
tion. It was noted that although the moderator 
variable approach appears to possess merit, 
moderators are often elusive. Their identification 
and their desirability may be dependent on a host 
of interactive effects. Thus, although no advanced 
program will ignore moderators, one should not 
anticipate that they will provide a pat solution to 
prediction problems. 



V. CURRENT TRENDS 

Trends 

About ten years ago, Schultz and Siegel 
(1961a) perceived a trend in evaluation research 
which has since been demonstrated. They found 
that rather than investigating an overall perform- 
ance criterion, it is better to use factor analysis or 
multidimensional scaling techniques to identify 
the important components of the job or training 
task. In the past, there has been too heavy a reli- 
ance placed on the single composite criterion. This 
practice is wasteful and hides useftd information. 
More and more I'ecent research has demonstrated 
that one score cannot possibly represent the multi- 
dimensional and orthogonal aspects of perform- 
ance. Once the investigator arrives at multiple 
criteria, he can use a weighted sum of the 
subcriteria to arrive at a composite evaluation. 
Schultz and Siegel also stressed in the validation of 
training programs the need to determine if 
performance changes over time. If so, one might 
wish to sample performance at different times or 
determine if a longer time span is needed. 

Merrifield (1965) agrees with Schultz and Siegel 
(1961a) about the need for more multivariate 
training evaluative studies. He places special 
emphasis in this regard on the special abilities 
student 



A second trend has been noted in terms of 
emphasis on cross-cultural training. Brislin (1970) 
presents a rather acid critique of most military 
cross-cultural training programs. The aim of cross- 
cultural programs, according to Brislin, is to allow 
the military to function behaviorally and effective- 
ly in a foreign environment. Most programs, 
thougli, do not have data on effectiveness, and the 
evaluative methods used are inadequate. When 
evaluations were conducted, they were too 
dependent on verbal and written reports of the 
trainees. More data need to be collected on the 
actual overseas behavior of trainees; therefore, 
responses to attitudinal questionnaires need to be 
verified by other means. Evaluation needs to be 
conducted by reseachers not associated with the 
program. Also, the attitudes of foreign nationals 
should be sampled. Techniques should be available 
to assess transfer of training to the actual foreign 
situation witii more replication and followup 
training. 

Fiedler, Mitchell, and Triandis (1970) and 
Worchel and Mitchell (1970) have recently de- 
scribed an exciting new technique known as the 
Cultural Assimilator, which is based upon tlie 
critical incident technique. In this technique, 
critical incidents are obtained in which the norms 
or behaviors across cultures are quite different. 
Questions are asked about the incident with 
multiple-choice answers and immediate feedback. 
A target sample from the host culture selects the 
correct multiple-choice responses. 

An experiment recently performed by the Navy 
compared two- and six-week Vietnamese language 
courses. The results demonstrated that (j) grad- 
uates of either course met most objectives in that 
they were able to acquire some vocabulary and 
conversational skills; (^) students of higher apti- 
tude performed extremely well in the six-week 
course; (c) the language laboratory produced prob- 
lems which were later rectified; (d) many grad- 
uates thought the course was inefficient and that 
they did not use all that they were taught; and(e) 
low-aptitude students were only marginally 
adequate. 

Predictive Evaluation 

Richards, Holland, and Lutz (1967) found that 
non-academic accomplishment was relatively in- 
dependent of academic achievement in college. 
Non-academic accomplishment in high school 
correlated .39 with non-academic accomplishment 
in college. On the other hand, the American 
College Testing Program’s College Admissions Test 
correlated .29 with college grades, and high school 



grades correlated .38 with grades in college. The 
autliors concluded that tliis study is important for 
college admissions officers who arc interested in 
the non-acadcmic as well us the academic potential 
of the students they accept. 

Ryan (1968) compared students taking a 
conventional 12th grade mathematics course with 
students taking an experimental mctlicmatics 
course to determine if prior courses in high school 
can moderate performance in college courses. The 
students were also given a mathematics achieve- 
ment test, a mathematics proficiency test, and a 
verbal ability test. The results showed that the 
mathematics achievement test correlated more 
highly with grades than did the mathematics 
proficiency test for the experimental group and 
visa versa for the the conventional group. Also, 
students in the experimental group performed 
significantly better than conventional students on 
mathematics achievement, but no better on 
mathematics proficiency or verbal ability. Hence, 
the achievement test probably reflects differences 
in prior instruction rather than differences in more 
general abilities. 

Goolsby, Frary, and Lasco (1968) compared 
the results of the Rorida Bar Examination with 
grades and aptitude test scores to determine if 
these latter measures could be used instead of part 
or all of the lengthy and expensive Bar 
examination. Only low correlations were found, 
causing the authors to conclude that no aptitude 
test scores or grades could supplant the Bar exam- 
ination. In another law predictive context (Klein & 
Evans, 1968), nine experimental measures were 
correlated with law school success for 978 law 
students across several schools. Undergraduate 
grade point average turned out to be the best 
predictor of law school grade point average ^ 
some schools, while the Law School Admissions 
Test was the best predictor in other schools. The 
authors concluded that undergraduate achieve- 
ment can predict graduate achievement for law 
school students. In another law school situation 
(Lunneborg & Lunneborg, 1967), 557 law school 
students were surveyed in order to ascertain which 
types of undergraduate courses predict law school 
success. Verbal, accounting, and language courses 
were found to be the poorest predictors, while 
philosophy, economics, history, and business 
administration were the best. 

Kaplan, Freedman, and Kaplan (1968) wished 
to examine the utility of replacing clinical ratings 
of psychiatry students with the National Board of 
Medical Examiners Test. This latter test was found 
to correlate .44 with the ratings. These writers. 
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though, indicate that other types of information, 
in addition to the test score, are needed because 
the written examination does not account for 
enough of the variance of the dimensions being 
investigated by the ratings. The dimensions of 
personality and psychopathology are not assessed 
by the test, but they are assessed by the ratings. 
Some furtlner investigation of tlie ratings seems 
warranted, though, since they are so much more 
subject to bias and error tlian tests. 

Bergstrom (1968) related measures of school 
achievement to important job behaviors in order 
to evaluate a school curriculum. A sample of 
students (N = 150) was taken from three types of 
schools: (a) urban vocational, (b) urban compre- 
hensive, and (c) suburban comprehensive. The 
results indicated that vocational training should 
stress person^ adequacy and communication 
skills, llie results of this study showed that (a) 
those employees with specific vocational training 
were more likely to be placed on a related job; (6) 
students with low grades (D) in vocational courses 
obtained lower job evaluation only in skill areas of 
the job; (r) graduates who were poor in school 
attendance tended to get significantly lower 
ratings; and (d) one-half of all trained workers 
were not placed or retained in a job they were 
trained for. 

Bale, Rickus, and Ambler (1970) wished to 
determine if undergraduate aviation training could 
be used as a predictor of graduate or replacement 
air group (RAG) instruction. The traditional 
criterion for student aviators has been successful 
completion of undergraduate (light training, but 
this was felt inadequate because it did not account 
for RAG instruction. The grades in training were 
based on (a) air to air weapons, (b) air to ground 
weapons, (c) basic ground, and (d) instrument 
navigation. The multiple regression coefficient 
between training grades and success-failure in RAG 
was .43; in a cross-validation sample it was .36. 
Use of these prediction measures would have 
reduced attrition in RAG by 34 percent. The 
investigators also found that 1 S tests gave a 
multiple R of .43, while four tests gave a multiple 
/?of.38. 

A final study demonstrates that OCS grades ^pan 
be used to predict officer effectiveness (Rhea, 
1965). The fitness reports of 2,1 83 OCS graduates 
were obtained after 18 memths of service. A low, 
but significant, correlation between each OCS vari- 
able and fitness was obtained (average r - .22). In 
general, fleet fitness reports were less predictable 
than shore fitness reports. The best predictors 



were final school grades and military aptitude 
which had correlations ranging from .16 to .37. 

Sensitivity Training 

Another comparatively recent innovation in- 
volves sensitivity training and its associated 
methods including T-groups, role playing, and the 
like. Bass, Thiagarajan, and Ryterband (1968) are 
severely critical of sensitivity, or T-group, training. 
They say that “. . . we still may hear complaints 
about the lack of evaluation of sensitivity training, 
yet a bibliography of at least 50 evaluative studies 
now exists. . . . why have these studies failed to 
impress social scientists? . . . A major reason may 
be because insufficient attention has been devoted 
to the purposes of the evaluation and the public 
for whom the evaluation is being prepared” (p. 
20 . 

One very controversial study by Golembiewski 
and Carrigan (1970) involved an assessment of 
change resulting from sensitivity training. The 
sample in this study was 16 commercial sales 
managers. Progress was measured by self-report on 
the 48 items of Likert’s (1967) Profile of Organ- 
izational Characteristics. The participants rated 
their organization twice, once as their conception 
of the ideal, and once as they perceived it to be in 
actuality. This was done both early in the week of 
training and four months after training. Both 
“ideal” and “now” scores increased in the interim 
in the “participative” direction, thus supporting 
the authors’ hypothesis. The authors themselves 
acknowledge the possibility of the Hawthorne 
effect or other methodological weaknesses in their 
design, but tend to minimize such possibility in 
favor of true change. Becker (1970), though, 
seems to think the study is of little value for 
several reasons: Golembiewski and Carrigan failed 
to rule out alternative explanations; they indicated 
that the Hawthorne effect cannot be rejected, yet 
they rejected it; and they failed to account for 
changes which could have occurred through 
passage of time. Becker closes with “. . . changes 
did and probably continued to occur, so it may be 
permissible to sell such a design to managements; 
but under no circumstances should one attempt to 
sell such a design as science (p. 96).” 

In another study (Cook, Hahn, & Sheppard, 
1971), 23 Navy Medical Service officers took part 
in a three and one-half day management style 
seminar; a six-month intervening period at a duty 
station followed; then a two and one-half day 
management style session was conducted. In their 
training sessions, the officers were presented with 
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(a) problem analysis using ''force field method;” 

(b) group ranking which allowed for cross-subject 
influencing; and (c) small group management style 
sessions. In the six-month intervening period, the 
subjects were urged to use their newly acquired 
techniques. The fmal session included discussion, 
reinforcement, and feedback of management style 
data. The Management Value Index (MVI), an 
index of management style, was given at the 
beginning and end of the first session, and at the 
end of the second session. The results indicated 
course influence. The Leadership Opinion 
Questionnaire was also administered, and the 
results indicated a decrease in structure without a 
corresponding decrease in consideration. These 
results are somewhat suspect, since participants 
thought their management styles were more open 
than did their colleagues and subordinates, 
especially with regard to participation. The 
authors concluded that the much larger value 
change between the second and third administra- 
tion of the MVI suggests the need for an on-the- 
job "incubation period” in order for attitudes to 
change. 

Federman and Siegel (1965), in a group dy- 
namics study, isolated four performance-related 
communication factors from training teams in a 
helicopter simulator. These four factors were 
derived from a factor ^alysis of 14 communica- 
tion predictors shown to be related to miss 
distance in antisubmarine warfare. The four 
factors were (a) probabilistic structure, (6) evalua- 
tive interchange, (c) hypothesis form>jlation, and 
(d) leadership control. In a second study, Siegel 
and Federman (1969) cross-validated the factors 
and developed a training course based on the 
derived factors. The trained group was founds to 
perform better than a control (untrained) group in 
two performance tests involving enemy submarine 
detection and destruction. 

Programmed Instruction 

Lumsdaine (1970) feels that the most impor- 
tant contribution of programmed instruction is 
not improvement in instruction, but rather in the 
implicit requirement for clearly stated objectives 
in behavioral terms. 

Mager (1970a, 1970b) maintained that it is 
impossiUe for the instructor to apply all the 
principles of learning in the classroom. This is not 
because he does not want to, but because the 
learning enviroiunent is prohibitive. "We still put 
large groups of students in front of a sin^e instruc- 
tor and insist that they all learn at the same rate 



(p. 4).” This procedure may be convenient and 
inexpensive, but it is inefflcient. Programmed 
learning devices and machines are held to possess 
the potential for solving these problems since they 
usually (a) present instruction in small steps; (6) 
reinforce the student along the way;(c) help the 
student proceed at his own pace; and (d) feed back 
responses into the device to modify instruction to 
fit the particular needs of the student. 

In sequential progranuning, learning jproceeds in 
very small steps, and all learners go through the 
same steps. In alternate programming, though, the 
student’s steps can be different, and they are 
governed by the student’s own responses. 

Keller (1968) indicated that the techniques of 
programmed instruction can be used in any class- 
room situation. However, according to Keller, one 
criterion that the instruction must meet is that it 
be individualized. Another requirement is that 
criterion-referenced testing be used. 

lindvaO and Cox (1%9) present a Structured 
Curriculum Model (SCM) for developing a pro- 
grammed instructional course. They state that one 
must define specific objectives and organize them 
according to dtfflculty or prerequisites. This organ- 
ization provides a structural sequence which is a 
frame for determining the student’s present status 
and for his future planning. In the SCM, the 
curriculum materials must be matched to the 
objectives, and one must keep in mind that 
students can master the same objectives with 
different kinds of material. In addition, the 
student must be given a diagnostic evaluation to 
place him in the proper location along the learning 
continuum. The placement test should ". . . 
select items which test representive objectives 
along the continuum (p. 170).” Pretests are also 
suggested prior to each instructional unit, because 
the student may be able to cope with some of the 
objectives in the unit, and not others. Evaluation 
in this model is by way of "curriculum embedded 
tests” and "post-unit” tests. Curriculum embedded 
tests (a) measure one objective of a unit; (b) they 
are content-referenced; (c) they are short; and(J) 
they enable the teachers to make decisions regard- 
ing student advancement. Post-unit tests help the 
teacher to decide whether the pupil should 
progress to the next unit or should be given 
remedial work. 

Glaser (1967) insists that uniformity within any 
one grade level can never be achieved because of 
individual differences. This results in the need for 
programmed or computer oriented instruction. 
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Glaser also suggests that too much research has 
been done comparing methods and not enough 
research has been done on learning what and how 
variables affect students. Glaser describes the 
requirements for individualized instruction that 
have been set forth at the Learning Research and 
Development Center: 

1. Time limits and grade levels must be 
redesigned so the student works at his actual 
achievement level, and he progresses only 
after he has mastered the prerequisites for 
the next ’/ligher level. 

2. Sequenr;es of progression must be assigned 
to each student. 

3. Progress must be continually assessed to 
modify the teaching program to Tit pupil 
needs. 

4. Materials should be provided to the student 
which win self-direct his learning. 

5. Performance standards (feedback) should be 
provided to the student. 

6. A data processing system should be provided 
so that the teacher can take advantage of 
detailed information about each student, 
and construct an appropriate program for 
him. 

7. Pretests and posttests should be provided for 
each instructional unit. 

8. Sequential testing procedures should be 
employed for initial placement. 

Whitmore (1970c, pp. 33-34) recites four learn- 
ing prinicples that are contained in automated 
individualized instruction that are not generally 
found in traditional instruction. These learning 
principles are (a) continuous participation by the 
student in the instructional process; (2>) providing 
immediate kno>^edge of the results to the student 
for each response that he makes; (c) recognition of 
individual differences in rate of learning; and {d) 
providing a high rate of success for the student 
throughout learning. 

The last principle, Whitmore says, is the most 
difficult to implement, since it requires very 
careful analysis of the material to be learned. 

McFarm (1969a, 1969b) characterizes training 
strategies and their characteristics as follows: 



Strsttgy Curriculum Time si 



1 


Fixed 


Fixed 


Variable 


2 


Fixed 


Variable 


Fixed or 
variable 


3 


Variable 


Fixed 


Variable 


4 


Variable 


Variable 


Fixed or 
variable 



In this scheme, a fixed standard means that the 
student is to reach a minimal level, while a variable 
standard means that the student can go beyond 
the minimal level to another higher level. 

Strategy 1 is only recommended when the 
input to the course is homogeneous; if it is not, 
there will be variable output. It ignores individual 
differences and involves the additional problem of 
where to set the level of training. Strategy 2 is 
similar to most present training in the military. 
Those who fail to pass the first time are recycled 
(variable output time). One can gear the training 
to low-aptitude men, or allow the more intelligent 
men to go through the program faster. Strategy 3 
has a fixed time limit and will result in variable 
output. Strategy 4 is the most flexible and the 
most individualized, but it requires the best 
management. 

Computer Assisted Instruction (CAI) and Testing 

Computer assisted instruction represents one of 
the most recent iimovations in training method- 
ology. One of the main problems of CAI is its cost 
when compared with other similar methods which 
might give equivalent results (e.g., TV) Another, 
more serious, objection to CAI is that it does not 
allow the student enough opportunity or freedom 
to chart his own progress (Hammel, 1969). 

Hansen, Hedl, and O’Neal (1971) feel that 
computer assisted testing will come into full 
flower this next decade. One reason given for this 
is the evidence that people answer questionnaires 
more honestly v^en they are presented via 
computer than by traditional methods. 

Holtzman(1971) says, “In a traditional setting, 
the instructor keeps a record of how well each 
student does on each achievement test for the 
course, v^ile the periodically collected scores 
from standardized normative tests a^e stored 
centrally. When instruction is individualized, test- 
ing must be done more frequently and at different 
times for each student (pp. 547-548).” 

Seidel (1%9) discusses the purposes of project 
IMPACT which is to provide the Army with an 
appropriate and efiicicnt CAI system adaptable to 
the individual trainee. Programs are to be branched 
and adapted to' the entry characteristics of the 
trainee and his performance throughout instruc- 
tion. Some of the important decision factors 
involved are {a) entiy ch^acteristics, (2>) education 
and background, (c) responses of trainee, (d) 
response latency, (e) pattern and history of enors, 
if) relation of individual and group norms to 
re.^ponses, and (g) subject matter. 
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Gagne (1968) disagrees with most of these 
writers regarding the usefulness of computers in 
testing (and instruction). He thinks that CAl puts 
too much stress on the machine rather than on the 
student 

Atkinson (1%7) discusses three levels of CAI: 

1 . Simple - “fixed, linear sequence of problems 
(p. 56).” Ihere is no method of changing the 
instruction as a consequence of the student’s 
responses. They are also called “drill and 
practice” systems. 

2. Complex - also called “dialogue” systems. 
They provide high-level interaction between 
student and system. The students can give 
many variations of response, can ask a 
variety of questions, and can generally 
control the sequenct of learning. 

3. Tutorial - are between simple and com(4ex 
with regard to the student’s interaction with 
the system. There can be decision making or 
branching, depending upon the student’s 
responses. The students can, therefore, 
follow separate paths. One of Atkinson’s 
findings was that fast learners, on a month 
by month basis, showed a continual 
improvement in rate of progress, wh'^e 
medium and slow students had constant 
rates of impro\’ement 

Ferguson (1970) described how computer 
assisted criterion-referenced measurement was 
applied to an experimental school in individually 
prescribed instruction (IPO- Addition and subtrac- 
tion skills were taught in a sequence in which each 
stage built onto and was required for the next 
stage. After each answer, the computer made a 
decision, on the basis of percentage correct and 
number of problems of this type attempted, 
whether to go to the next level or continue 
presenting proUems of the same type. Each item 
was randomly selected from a population of 
similar items. Direct manqiulation of type I or 
type II errors was possible. The type I enor allows 
the student to progress to the next level prior to 
mastery; therefore, this is considered the most 
serious type of enor. 

Apidkalioiis of IVogrammed Instiuc^ 

Yeager and Kissel (1969) hypothesized that the 
number of days needed to master a unit of instruc- 
tion is related to the students’ “initial entering 
state.” The entering state variables were (a) unit 
pretest score vrfiich, vriien subtracted from 100, 
gives the distance or amount to be learned; (b) 
number of types of pretest skills on which the 



student failed to show mastery (1P1 only concen- 
trates on these); (c) intelligence; and (c/) age which 
reflects student maturity. Tne entering state 
variables used in this study, therefore, were pretest 
scores, number of skills to be mastered, I.Q., age, 
and total units mastered previously. The results 
demonstrated that pretest score, numbers of skills 
to be mastered, and age were the best predictors, 
v4iile l.Q. score had the least influence. The 
multiple correlation coefficients for different 
types of materials ranged from .65 to .84 (A=40). 

Atkinson (1%7) found that students in an 
experimental CAI reading program performed 
significantly better in all aspects of reading (eg., 
pronunciation, vocabulaiy, recognition) than did 
students in conventional (control) reading classes. 
The control group received CAI mathematics 
instruction, but not CAI reading instruction. 

K. Johnson (1968) examined the results of 
three different methods of teaching militaiy com- 
municatiens courses. The three methods used were 
conventional, programmed instructional booklets, 
and partially individualized (first week 
conventional followed by self-paced). The results 
showed that the self-paced' (partially individ- 
ualized) instruction produced a 16 percent 
reduction in course leng^, while the programmed 
instruction produced a 9 percent decrease in 
course length. These reductions w ere accomplished 
without loss of skin. 

Geisert (1970) wished to examine the contribu- 
tion of format and feedback to learning. Two 
groups of Army National Guardsmen (A==44) 
used as subo^ls- ^ concepts to be learned in the 
experiment^ group were ananged hierarchically 
(mapped) to ease positive transfer tc the next 
U^est level Fifteen dependent variables were 
used including reading time on booklet, test 
scores, time spent reading instructions, rime spent 
on practice, and time spent on problem solving 
instructions. The results demonstrated no signifi- 
cant differences betvwen the hierarchical group 
and the traditional group, except that the former 
group tended to do ^ things slightly faster. 
Similar results were obtained for the feedback-no 
feedback group. With regard to certain attitude 
scales v/hkh were administered, it was shown that 
subjects preferred to learn from the m .pped- 
feedback system over the traditional system. The 
subjects also thou^t that a computer assisted 
screen was an effective way to present .naterial 
when compared to booklet material, although 
neither was shown to be more or less effective 
than the other. 




30 



38 



A novel and interesting approadi to self-paced 
instruction was recently developed by Sheppard 
and MacDermot (1970). Subjects were 203 
students enrolled in an experimental course and 98 
students enrolled in a traditional course. Tlie 
students in the experimental group were to study 
one of 36 sections of a psychology book. After 
study, the students were asked to explain the 
lesson in detail to another student who had 
already completed the work, or to an instructor. If 
tlie learner failed, he would repeat the lesson untfl 
mastery was achieved. Completion of all 36 inter- 
views earned a grade of A, 75 percent a grade of B, 
50 percent a C, and 33 percent a D. The control 
group was as comparable as possible, since the 
students spoke in small groups and used the same 
book. At course completion, both groups were 
given 100 muitiple-choice questions and five essay 
questions. Thr control group was told that the 
final examination contributed 50 percent of their 
grade, while the experimental group was told that 
the final examination did not count. In addition, 
the control group was informed that they had to 
finish the entire test These last two factors should 
produce a bias in favor of the :ontrol group. The 
mean for the experimental group on the multiple- 
choice test was 73.1, and for the control group it 
was 66.8 (p< .01). On the essay questions, the 
experiment group scored 17.4, and tfie control 
group 13.9 (p< .01). Also^ composite student 
satisfaction, as measured by an attitude scale, was 
higher for the experimental group (p< .01). Of 
those queried, 94 percent thought the interview 
method was more effective than the lecture 
method. 

Siegel and Fischl (1%5) were concerned with 
pre-emergency training which prepares the public 
for a disaster or critical situation. They employed 
a technique known as ‘^adjunct auto-instruction,** 
which is meant to supplement other training tech- 
niques or points that need emphasis and stress. 
Adjunct auto-instruction tends to keep the learner 
active, and gives him feedback. The subjects were 
four matched groups = 9 to 13 per group) of 
semi-skiDcd, adult, employed women receiving 
attack survival material. The four experimental 
conditions provided that the subjects (a) receive 
material by phone, (b) read material in print, (c) 
read material in print and receive adjunct 
auto-instruction, or ('/) recrive material by tele- 
phone and receive adyunct auto-instruction. The 
non-adjunct groups were presented the material 
twice to equate for exposure time. A final exam- 
ination administered at the end of training demon- 
strated that both adjunct types were significantly 
superior in promoting learning gains over non- 
adjunct materials (p< .01). 



A CAl data management systc.n was developed 
by Ford and Slougli (1970) for an electronics 
course module. Tlie course was tried out and 
revised three times using a total of 52 subjects. 
Next, the module was compared with nomial class- 
room training using 51 CAl subjects and 200 
traditional subjects. Afterwards, both groups took 
a standard school examination and a 
supplementary test. For all ability levels, CAl 
produced higher achievement than traditional 
classroom instruction. In addition, CAl produced 
time savings of 33 to 44 percent. 

Showel, Taylor, and Hood (1966) constructed a 
leadership training package including tapes, film- 
strips, and workbooks. This training package was 
used for an experimental group whSe a control 
group received traditional instruction (Le,, 
lectures). The subjects were matched on the 
General Technical Aptitude area of the Army 
Gassification Battery and rindomly assigned to 
control and experimental groups. An essay exam- 
ination was used to test achievement immediately 
after training and 10 weeks after training. The 
results demonstrated that the leadership auto- 
mated package produced greater gain and was less 
costly ^an the conventional package. 

Steadman, Bilinski, Coady, and Steinemann 
(1969) were interested in investigating alternate 
methods of training low-aptitude Naval personnel. 
Of 31 subjects, half were taught by instructor and 
half by programmed text. Achievement was 
measured by three quizzes and a practical perform- 
ance test. Upon the termination of training, only 
eight subjects reached an adequate proficiency 
level in terms of the final practical performance 
test. These writers concluded that, in general, the 
course was not appropriate for low-aptitude 
personnel. 

Ph>grammer Characteristica 

The selection of programmers for programmed 
learning is just as important as the selection of 
materials. Some of the characteristics of successful 
programmers are (a) “relatively higrrintelligence,** 
(6) “interests in the area,’*(c) “attitudes favorable 
to the area and favorable to achieving the goal,*^ 
(d) “compulsivity,** and (/) “functional level of 
motivation (Melching, 1970, pp. 71-72).** 

Television Instruction 

TV instruction, althou^ not used in the same 
way as CAl, is much less costly. TV instruction 
seems advantageous when instructor shortages 
exist, r apid dissemination of information is 
required, and student communication is not 
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necessary. This type of instruction is disadvan- 
tageous when applied lessons and student 
communication are needed. 

Bask Education 

StandDee and Hooprich (1962) feel that most 
tests of the effects of adult reading courses lack 
sophistication. Most experimenters measure read- 
ing ability before and after training, but fail to 
control for, such factors as initial reading level, 
intelligence, motivation, equivalence of forms, test 
practice effects, set, test ceding effects, change, 
regression effects, timed tests, type of test score, 
criterion choice, and differences between control 
and experimental subjects. These authors, after 
reviewing several sound studies, arrived at the 
following conclusions: 

1. Reading speed gains are real. What happens 
to comprehension and vocabulary is un- 
certain, since they are confounded with 
speed. Eye movements usually improve. 

2. Reading speed gains are retained. Generally, 
60 to 70 percent was retained after six 
months to two years. 

3. Reading instruction gains trarrsfer to 
academic achievement, academic aptitude, 
clerical ability, and temperament. These 
gains may not be due to reading instruction, 
though, because these courses may also 
leach study skiiis, or give counselling and 
therapy which can also be associated with 
improvement 

4. No methods, materials, or progcvuus of 
instruction were shown to be superior to any 
other. Also, no individual differences in 
personality, intelligence, or occupation were 
associated with reading skill gains. 

5. Reading improvement courses are helpful for 
those whose jobs depend upon reading. In 
this case, increased speed is enou^ justiflea- 
tion for taking the course. 

Steinemarm, Hooprich, Archibald, and Van 
Matre (1971) investigated the effects of a 
“wordsmanship” course given to 176 low-aptitude 
Naval pe sonnet Tliese subjects characteristically 
have low verbal aptitude and unfavorable language 
attitudes which cause a bias against learning. 
Nevertheless, these investigators found that “the 
trainees substantially improved their knowledge 
and proflciency in each of th^ subK:ourse areas of 
wordsmanship, and most students reported a more 
favoraUe attitude toward words and a desire for 
self improvement of verbal skills.^' 



Mollcnkopf (1969) gave different 100-hour 
basic skills training courses (computation, spelling. 
Tiling, reasoning, paragraph meaning) to tiucc 
different groups (office workers, laboratory tech- 
nicians, and production employees). Most of the 
participants made sizable gains and most pretest 
and posttest score differences were significant, 
although regression and ceiling effects may have 
been involved. In almost all of the tests, at least 80 
percent of the students made gains. 

Hooprich and Steinemann (1966) indicated 
that there is **a general trend toward performance- 
oriented training courses in which technical mathe- 
matics and unnecessary electronics theory are 
minimized. . . . Increasing investigative attention 
devoted to performance evaluation pr /uiems is a 
reflection of the growing recognition of perform- 
ance assessment as a critical factor in the final 
evaluation of total training effectiveness (pp. 
17-18).” 

Kent, Bishop, Byrnes, Frankel, and Herzog 
(1971a, 1971b) attempted to identify the Adult 
Basic Education (ABE) courses that were success- 
ful in job related settings (e,g„ obtaining job, 
promotions, entering training). Information was 
collected on 80 programs whose features or 
aspects were typed, fifteen programs containing 
all features of interest were selected for the study. 
Checklist interviews were used to obtain data. The 
findings indicated that (a) there is a great need for 
ABE in basic abilities which vary from student to 
student and job market to job market; (b) the 
need for job related ABE is not being met in that 
the programs do not perform enough job place- 
ment, skill training, post instructional followup of 
students, self-evaluation, and improvement of 
materials; (c) theory, administration, and money 
are inadequate; (d) ABE programs should co- 
operate among themselves and *Alth large centers 
for research; and (e) organizations should be 
invited to bid in order to conduct ABE job related 
programs. 

Training Devices 

Edgerton and Fryer (1950) have prepared a 
system for preliminary evaluation of a training aid. 
This ^stem has the following features: (a) it is 
uniform and consistent; (b) it is brief; (c) it needs 
no special skills to administer; (d) it improves 
validity of technical judgments; (e) it shows 
advantages and defects of the training aid; (/) it 
provides for an overall judgment; and (g) it yields 
information from whfch an experimental evalua- 
tion of the training aid can be constructed. 
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Richardson, Bellows, Henry & Co. (1962) 
developed three evaluation fonns for new training 
devices. These forms were constructed from litera- 
ture reviews, descriptions of Navy devices, descrip- 
tions of industrial devices, and evaluation reports. 
These questionnaires were validated using the 
nomination technique in which instructors and 
training officers nominated devices as “best” or 
“worst.” The resultant validity and reliability of 
the tliree methods proved adequate enougli for 
use. 

Siegel and Federman (1969) used Guilford’s 
(1967) structure-oMntellect (SI) model to help 
derive the most appropriate aids and devices for 
training the tactical coordinator in the P-3c air- 
craft. Guilford’s model allows the description of 
the mental tasks an operator performs in terms of 
intellectual load. These descriptions are quantita- 
tively derived, and the needed aids and devices can 
be based upon them. The operations in the SI 
model specify the type of aids or devices for train- 
ing. The contents in the SI model tell the subject 
matter of the aids or devices. Finally the SI 
products tell what is to be learned. The authors 
conclude that this technique defines training 
requirements and closes “. . . the loop between 
job analysis and the aid/device derivation.” 

Instructor Evaluation 

A. Harris (1%9) has found . . differences 
among teachers far more important than differ- 
ences between methods and materials in influ- 
encing the reading achievement of children (p. 
204).” The main criterion of teacher effectiveness 
should be pupil gain on standardized tests. The 
correlations between tcachei ratings and tests are 
not large enough to support the use of ratings. 

Bittner (1968) recently executed an interesting 
analysis of student evaluations of instructors. 
Subjective comments were collected from students 
on oral communication factors. These statements 
were content analyzed by six speech teachers 
(interrater reliability = .73). Five categories were 
derived: (a) rate of speaking, (b) volume, tone, and 
pitch, (c) use of audio-vidual aids, (d) use of 
discussion, and (e) organization of lecture. The 
largest number of comments concerned organiza- 
tion of lecture, while volume, tone, and pitch had 
the smallest number of comments. The most- 
negative comments cof:cemed volume, tone, and 
pitch, and the most positive concerned use of 
audio-visual aids. Rate of speaking was also some- 
what negatively appraised, in addition, more 
negative comments were associated with graduate 
teaching assistants than witl. any other category. 



Veldman and Peck (1969) wished to determine 
the influence on pupil evaluations of student 
teachers. These authors felt tliat the most reliable 
description of teacher behavior comes from the 
students. The Pupil Observation Survey (POSR) 
consisted of 38 items grouped into 10 scales. 
POSR data were collected on 554 student teachers 
at tlie University of Texas. The data were tlien 
factor analyzed, yielding five factors: (a) friendly 
and cheerful, (b) knowledgeable and poised, (c) 
lively and interested, (d) firm control, and (c) 
non-directive. Analysis of covariance was used to 
determine if five characteristics (grade in student 
teaching, grade of class, subject area, socio- 
economic status, level of school, and sex of 
teacher) had any effects. The results demonstrated 
that (a) all factors increased with increased student 
teaching grade; (b) only friendly-cheerful and 
lively-interested were positively and inversely 
related to grade level of students; (c) all factors 
except knowledgeable-poised were related to 
subject matter area; (d) as social class decreased, 
lively-interested increased, firm control decreased, 
and non-directive increased; and (e) females were 
rated higher on friendly -cheerful than males. 

Hiller, Fisher, and Kaess (1969) performed a 
computer investigation of the verbal characteristics 
of effective classrooom lecturing. Fifty-five 15- 
minute lectures producing 105,000 words were 
analyzed for verbal fluency, optimal information 
amount, knowledge structure cues, interest, and 
vagueness. The findings demonstrated that vague- 
ness in the lecture was most important. Vagueness 
is defined as “. . . the state of mind of a per- 
former who does not sufficiently command the 
facts or the understanding required for maximally 
effective communication (p. 670).” 

Military Research 

Electronics Technicians, Applied Psychological 
Services (1971) recentiy developed a quick course 
of passive sonar training for system technicians. 
First, the training requirements were developed, 
followed by a course which was balanced between 
practical work and lecture presentation. Sonar 
technicians were given the course in one week. 
After finishing the course, they each completed a 
13-item questionnaire. The mean value on a four- 
point scale for all 13 questions was 3.4. Hi^ 
values were concerned with the amount the 
student learned in the course. The authors con- 
cluded that this project was extremely useful, 
since demonstrated that quickly but systemat- 
ically developed courses could be useful. 
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Bilinski, Saylor, and Standlec (1969) used an 
analysis of on-the-job feedback to help increase 
training effectiveness. Electronics technician grad- 
uates were examined in regard to their ability to 
maintain a radar system. First, a job analysis was 
performed; then a structured interview was con- 
structed from the job analysis to obtain 
infontiation from a fleet sample of electronics 
technicians. This procedure elucidated difficult 
maintenance and problem areas for feedback into 
the training school. 

Steinemann, Coady, Harrigan, and Matlock 
(1968) wanted to evaluate the job capabilities and 
fleet utilization of 64 four-year obligor graduates 
of electronics technician phase A-1 training. 
Performance measures and objective ratings were 
collected. Most electronics technicians were found 
to be more or less adequate. However, training 
limitations made on-the-job training and initial 
supervision necessary for all but the most routine 
tasks. Troubleshooting was found to be the 
weakest area. It was recommended that four-year 
obligors be given more training, or only be allowed 
to assist in fleet maintenance tasks. Steadman and 
Harrigan (1971) obtained similar results with six- 
year obligor data systems technicians. They 
suggest deemphasis of irrelevant electronics theory 
in favor of more practical training. 

Helicopter Training, The studies discussed in 
this section were reviewed in a previous chapter of 
this report. The emphasis then was on dependent 
measures; now it is on evaluation. 

Greer, Smith, and Hatfield (1967) wished to 
control for checkpilot personal bias in rating 
rotary wing students. The resultant ratings 
reflected the checkpilot ’s own standards rather 
than the student’s flying skill. The training pro- 
gram was analyzed into maneuver components. 
Proficiency scales and instrument observation were 
substituted for the checkpilot’s own method. The 
Pilot Performance Description Record (PPDR) was 
constructed to reflect the most critical aspects of 
each maneuver. The PPDR was administered to 50 
advanced and 50 intermediate students. The 
results demonstrated that (a) reliability of flight 
proficiency evaluation improved; {b) the PPDR 
recorded specific student deficiencies; (c) check- 
pilots 1^0 were trained in PPDR were more 
consistent in their evaluation than checkpilots who 
were only oriented in PPDR; and (d) checkpilot 
training is necessary in the use of the PPDR. 

Another approach, used by Greer (1968), to 
compensate for the variations in checkpilot stand- 
ards involves grouping checkpilots with similar 



standards. Checkpilots were asked to complete an 
1 1-point rating form, and those who agreed at .90 
or better were paired together. In their actual 
evaluation duties, they correlated .65. It seems as 
thougli the earlier approach (Greer et al., 1967) is 
more fruitful, since tlieir checkpilots became 
better, less biased observers of behavior, while in 
this latter study (Greer, 1968), the checkpilots’ 
bias is still allowed to operate. 

Duffy (1968) and his associates (Duffy & 
Anderson, 1%8; Duffy & Jolley, 1968) produced 
an objective and detailed scoring record. Students 
were scored on checkrides during and after train- 
ing to yield a class percentage error. This 
procedure allows for class comparisons, grade 
comparisons, and instructor comparisons. If partic- 
ular errors are identified among the students of 
one instructor, the instructor is given additional 
instructor training. Finally, if one checkpilot is 
more strict than the others, he is given counsel to 
make his observations more conforming. 

Officer Training, Glickman and Vail ance (1967) 
wished to find those aspects of the OCS cur- 
riculum which were most and least relevant to the 
job requirements of ensigns on destroyers. One- 
thousand critical incidents were collected and 
classifed as to “taught” and “not taught.” Check- 
lists containing 100 of the resultant items were 
sent to 30 to 50 higliJevel officers. They were 
required to judge the length of time in service after 
which the new officer should be able to handle the 
incident The sooner an ensign was expected to 
cope with an incident, the more important that it 
be learned in OCS. Human relations, personue 
administration, and leadership skills were found to 
be more important in this context than technical 
skills. 

Morsh (1969) administered an officer manage- 
ment inventory to 10,242 Air Force officers who 
ranged in rank from lieutenant through colonel. 
The management inventory consists of a listing of 
tasks and duties, and a listing of military education 
topics. The officers rated, on an eight-point scale, 
the extent to which each task is a part of their job, 
and the extent to wiiich each educational topic is 
useful in their job. Forty-three managerial types 
were derived from this analysis, although there was 
much overlap across types. The extent of 
managerial responsibility was directly related to 
officer grade. Also identified were training needs 
in leadership, communication, creative and logical 
thinking, problem solving, officer ethics, discipline 
and morale, and military customs and security. 
Other training topics were found to be of little 
use. 
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Task Analytic Methods, Stewart (1970) used 
task analysis to evaluate training effectiveness. 
Military task data were coUected and analyzed to 
determine the extent to which it is job oriented. 
Stewart found that^ in terms of cost, overtraining 
was as significant a problem as under^raining. 

Siegel and Schultz (1961) and Siegel, Schultz, 
and Fedennan (1961) designed a system of train- 
ing evaluation using rriatrix concepts. Essentially, 
t r a i ning is acceptable if the average trainee 
performs with proficiency on a hi^y important 
task. Training is poor if the average worker 
performs poorly on a very important task and is 
very proficient on a task of low importance. This 
technique can yield a training index, an overtrain- 
ing index, and an undertraining index for the 
entire training program. In addition, this method 
points to deficiencies in the program which need 
emphasis and parts of the program which need 
deemphasis. Schultz and Siegel (1962a, 1962b) 
applied the technique to posttraining performance 
of four Naval ratings. The results demonstrated 
that none of the groups were undertrained, while 
two of the groups seemed overtrained. 

Aircraft Recognition, Whitmore, Cox, and Friel 
(1968) performed a study concerned with ground 
to air recognition training. The original training 
program for this asnect of aircraft recognition was 
thought to be in dequate. First, ground to air 
recognition slides were selected (16 Soviet and 
American jet fighter/attack aircraft). The paired- 
comparison method was employed to train in the 
discrimination. Eight-second exposures were given 
during training while five-second exposures were 
selected for the test. The results demonstrated that 
(a) 16 sessions were needed to achieve a 95 per- 
cent average recognition level; (b) class average on 
degraded images was 61 percent; (c) degraded 
images correlated .82 with the training achieve- 
ment tests, indicating that the skill learned during 
training was not specific to the training slides; and 
(d) trainees maintained approximately the same 
position in class from achievement test to achieve- 
ment test. 

Summaiy 

This chapter began with a discussion of some 
generally recognized trends. The most important 
trend seemed to be increased recognition of the 
multidimensionality of criterion measures. Next, 
there was a discussion of training needs and 
deficiencies followed by a very critical discussion 
of trends in cross-cdtural training. This was 
followed by a presentation of some studies con- 
cerned with achievement measures as predictors of 



later success. Then there were reviews of studies 
involving sensitivity training, programmed 
instruction, CAl instruction, basic education, 
training and evaluation, and instructor evaluation. 
The final portion of this chapter was devoted to 
recent militaiy research including electronics tech- 
nician training, helicopter training, officer training, 
task analytic methods of evaluation, and aircraft 
recognition. 



VI. COMPARATIVE EVALUATION 

This chapter is divided into two parts. The first 
section involves comparative evaluation studies of 
non-low-aptitude men, while the second section 
focuses on low-aptitude evaluations. Generally, the 
studies reported here involve a relative comparison 
between two O’- more methods of instruction or 
training. In many cases, a new training method is 
compared with a standard method to determine if 
the latter should be replaced by the former. 

Comparative Studies of Subjects Within Average 
or Higher Aptitude Ranges 

Steinemann, Coady, Harrigan, Matlock, and 
Steadman (1969) compared six-year obligor 
electronics technicians with four-year obligors who 
are given less training. Six-year obligors were 
found to perform better on troubleshooting tests, 
test equipment examinations, written theory, and 
equipment tests. Questionnaire data on school 
limitations in troubleshooting were verified by the 
relative weakness found in this area as indicated by 
performance tests. 

Hurlock (1971) grouped electronics technician 
training objectives into four short CAI lessons. 
Fifty randomly selected students were given CAI, 
and 180 were given traditional training. All 
subjects took the same final examination. The 
results demonstrated that overall achievement was 
10 percent higher for CAI students. In addition, 
CAI instruction reduced training time 48.5 percent 
(17 hours to 8 3/4 hours). 

Askren and Valentine (1970) were interested in 
the differences between Air Force instructors with 
job experience and without job experience in 
teaching a specialty area. The criteria used were 
student grades, student critiques, and supervisory 
evaluation. Seventy instructors and 585 students 
were used as subjects. Their conclusions were that 
(a) there were no significant differences in overall 
course grades across instructor type in a 
pneudraulics course; (b) there was an interaction 
for an environmental system course such that 
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grades of students from field-experienced teachers 
increased from the beginning to the end of the 
course and decreased for non-fieId<expcrienced 
teachers from the beginning to the end of the 
course; (c) there vere no significant differences in 
the student .ritiques; {d) field-experienced 
teachers were given an average supervisory rating 
of 3.22 (on a five-point scale) while non-field- 
experienced instruction received an average rating 
of 3.06; (e) a small number of tlie rating 
categorics-knowledge of subject, student interest, 
and student participation-caused most of the 
difference; and (/) the job-experienced instructors 
were hotter at teaching theory. These investigators 
concluded that there is little practical difference in 
instn'.ctor type, but, if a shortage of field- 
experienced instructors exists, field-experienced 
persons should be used in practical, shop related 
courses. 

Tallmadge (1968) attempted to study the inter- 
actions between trainee characteristics (c.^., apti- 
tudes and interests) and training methods. A one- 
week segment of Navy radarman school students 
was used as a setting for this experiment. In 
addition, a 32-item criterion test was developed. 
Three experimental conditions were involved: (a) 
subjects tau^t using rote memorization methods; 
{b) subjects taught problem solving, principles, and 
rationale approach; and (c) a standard approach, 
wliich is a mixture of other two methods. The 16 
aptitude and interest measures did not interact 
with the three training methods as hypothesized. 
Perhaps the wrong training methods or the wrong 
aptitude and interest measures were used. It is also 
possible that other interactions existed which 
obscured the hypothesized interactions. Subjects 
in the rationale and understanding condition 
performed significantly better on the criterion test 
than the others, thus supporting the contention 
that this approach results in a hierarchically higher 
type of learning with better retention. 

McFann, Buchanan, Lyons, Ward, and Waits 
(1958) compared a conventional Known Distance 
marksmanship training course with a new Trainfire 
I rifle marksmanship course. After four weeks of 
training, both groups received target detection and 
the Trainfire I marksmanship proficiency tests, as 
well as the conventional Known Distance test. The 
results demonstrated that Trainfire I training 
produced {a) a greater number of detected targets; 
{b) a shorter latency of target detection; (c) more 
target hits; {d) a higher percentage of men qualify- 
ing (the sum of marksman, sharpshooter, or 
expert); and (e) lewer qualifying as expert on the 
Known Distance range. 



Olmstead (1968) compared Quick KUl Basic 
Rifle Marksmanship training (QKBRM) witJi tradi- 
tional Basic Rifle Marksmanship training (BRM). 
QKBRM involves training the student to.engage a 
target without aligning the sights of the weapon. 
Two experimental groups received QKBRM in 
their training and one control group received tradi- 
tional BRM training (total N = 824). One of the 
experimental groups received a pre-training and a 
post-training questionnaire, and the otlier experi- 
mental group received only a post-training 
questionnaire. Control and experimental groups 
were compared on gains in confidence, attitude 
toward BRM, and drill sergeant attitudes toward 
QKBRM. Findings indicated an increase in con- 
fidence in both groups with QKBRM trainees 
gaining more confidence than traditional BRM 
trainees. The drill sergeant’s attitude, though, was 
only somewhat favorable. One undeniable method- 
ological weakness in this study is that the authors 
did not report any proficiency or marksmanship 
data across experimental groups. 

Another study in this group concerns the 
effectiveness of an apparatus used as a simulator in 
driver training. The simulator-trained group was 
found to be superior in this experiment to the 
group trained on a projection-type driver trainer 
(Jeantheau & Anderson, 1966). 

Caro and Isley (1966) used four groups of 33 
subjects each in a study of Naval helicopter flight 
training. Groups A and B flew a training device 
3.17 and 7.13 hours, respectively. Two control 
groups, C and C^, received no device training. The 
Fisher exact probability test demonstrated that 
both device groups had fewer eliminations from 
training thai. Jid both control groups (10 percent 
to 30 percent at p<.006). In addition, the control 
groups had more unsatisfactory' and below-avera^ 
grades than did tl\e two experimental groups. 

In another study, Isley, Caro, and Jolley (1968) 
examined the advantage of a modified fixed wing 
device as a synthetic trainer for rotary wing proce- 
dures and aircraft control. Three groups of trainees 
were used each with 0, 10, and 20 hours, respec- 
tively, of synthetic training time. The experi- 
menters found no difference in time to complete 
the course or in helicopter flight performance. 

Isley (1968) and Isley and Caro (1-969), in 
similar studies, examined the effects of a fixed 
wing rotary aircraft instrument trainer. Warrant 
officer candidates were divided into three 
treatments with 0, 10, and 20 hours, respectively, 
of synthetic training. The criteria used were devia- 
tions from regulation on 10 flight parameters in a 
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chcckridc. Ttic results dramatically favored the 
group with no syntlietic training in that they 
performed as well or better than the 20-hour 
gr o u p . The au ihors of this study seriously 
questioned use of tlte simulator. 

Rhodes (1950) attempted to compare a new 
and an old ejection-seat trainer. The new trainer 
was more mobile, not as higii, and more realistic in 
that it had a dummy cockpit. Training consisted of 
rdm, a lecture, and an ejection. Attitude was 
measured in both an “old” and a “new” group 
before and after ejection on each device. A group 
of reserve pilots was used as a control. No differ- 
ences were found across groups; therefore, each is 
regarded as equally effective. Attitude did improve 
for both groups combined witli reference to gain 
scores (p<.01). The author concluded that, regard- 
less of device, overall ejection-seat training tends 
to increase confidence and decrease fear of this 
bailout method. 

Gabriel and Burrows (1968) performed a study 
of pilot time-sharing performance. Time-sharing is 
concerned with alternating attention between two 
or more sources of information. Specifically, the 
pilot uses his instrument panel so much that he has 
little time to devote to outside scanning of the 
environment. The training task in this study was to 
improve the perception of midair threats of 
collision. The results suggested that use of the 
simulator can increase efficiency of pilot tin)^;- 
sbaring between intra- and extra-cockpit stimuli. 

Ward, Fooks, Kern, and McDonald (1970) 
wished to determine if the Basic Combat Training 
(BCT) and the Advanced Infantry Training (AIT) 
courses could be integrated Tor a sample of con- 
scientious objectors in medical corpsman training. 
The content of the training courses currently used 
was catalogued. A job activities questionnaire was 
developed reflecting emergency medical care and 
secondary and recuperative treatment. The four 
types of tasks included in the training were 
company aidman, evacuation medic, aid-station 
dispensary medic, and ward nursing c^e medic. 
The criteria for selecting these groupings were 
availablity of supervision, frequency, and oppor- 
tunity for on-the-job training. In the resultant 
16-week course, practical work was emphasized 
and lecture was deemphasized. A large amount of 
TV instruction was used for 80 experimental 
students. For 80 other students, traditional train- 
ing was involved. Combat proficiency, aidman 
proficiency, and attitude questionnaires were 
administered to all the trainees. In addition, an 



evaluation questionnaire was given to the instruc- 
tors. The results of this effort demonstrated that 
(a) on military proficiency tests, botli 
experimental and control groups performed 
equally well; (b) control subjects performed better 
on the Basic Combat Proficiency Test;(c) experi- 
mental subjects did better on physical skills used 
by medical corpsmen;(r/) there were no significant 
differences in written knowledge tests; (e) experi- 
mental subjects performed better on medical 
performance tests; (/) experimental subjects had a 
higher opinion of the Army and its training than 
did standard subjects; and (g) instructors thought 
the experimental program was superior. 

Judisch, Cooper, Francis, and Ray (1968) in- 
vestigated the present curricula and job require- 
ments of graduating medical corpsmen from two 
schools. They found that on knowledge tests San 
Diego students performed better on anatomy, 
physiology, first aid, and nuclear biological and 
chemical warfare. On the ot* er hand. Great Lakes 
students were superior in patient care. A perform- 
ance decrement was found over time such that, 24 
weeks post-training, graduates were 10 percent 
worse than current students, and graduates of over 
24 weeks were 16 percent worse. Also, a survey 
was performed to determine how much and where 
prior knowledge and information were acquired. 
Students reported gaining prior knowledge from 
lectures, films, readings, practical experience, and 
other visual aids. In all, though, this knowledge 
accounted for only 10 percent of the school 
knowledge. It was also found that San Diego 
students learned more from lectures than did 
Great Lakes students, and that Great Lakes 
students learned more from reality than did San 
Diego students. As a consequence of these results, 
the authors recommended revision in the cur- 
riculum. 

Richlin, Federman, and Siegel (1958) compared 
general Naval technical training with a more 
specialized type of training under the Selective 
Emergency Service Rate Program (SESR). Each 
Naval rating in this program is subdivided and 
given a more specialized, shorter type of training. 
After training the men are utilized mostly in tasks 
for which they were trained. A Technical Behavior 
Checklist (TBCL) was developed as a criterion of 
performance for aviation machinist mates in the 
SESR program. Items for the TBCL were derived 
from tasks selected for their importance to the 
job, time consumed, and variability. The results of 
this study demonstrated that graduates of the 



SESR program were equal to or better than the 
graduates of the more generalized program. Several 
other SESR studies were performed. In these 
studies it was demonstrated that (a) SESR trained 
air controllers performed as well as generally 
trained air controllers except in tower operations 
(Siegel, Richlin, & Federman, 1958); (6) SESR 
trained parachute riggers performed as well or 
better than generally trained parachute riggers 
(Siegel, Richlin, & Federman, 1958); and (c) SESR 
trained avionics technicians performed as well or 
better than pre-SESR trained avionics technicians 
(Richlin, Siegel, & Schultz, 1960), 

Siegel, Federman, and Richlin (1959) adminis- 
tered a series of interviews to officers and petty 
officers in order to assess their opinion of the 
SESR program. One problem identified was the 
difficulty of assigning tasks to a more specialized 
man. Some supervisors felt SESR trained graduates 
achieved competence earlier, but that the more 
generally trained men were more useful. 

CAI and TV Instruction, Gallagher (1970) 
attempted to investigate relevant learner charac- 
teristics and optimal types of instruction. He used 
four treatments: (a) computer assigned sequence 
of instruction— instructor evaluated product; (6) 
computer assigned sequence of instruction- 
computer evaluated product; (c) student selected 
sequence-instructor evaluated product; and {d) 
student selected sequence— computer evaluated 
product. Separate analyses of variance were 
conducted on the emergent data for four depen- 
dent variables: midterm examination, final 
product score, teiminal or system time use, and 
lime to complete cognitive portion of task. The 
results indicated that (a) there were no significant 
effects on any of the dependent measures; {b) 
bc>th self-sequenced groups achieved superior 
performance on three of four dependent measures; 
(c) the computer assigned sequence of instruction 
was best in terms of cost; {d) those who performed 
best on the dependent measures were enthusiastic 
about the computer presentation; and (^) in- 
dividual differences were minimized in the com- 
puter evaluated group. In conclusion, specific 
learner characteristics were related to success, and 
the student selected— computer evaluated 
approach was best in terms of costs. 

Fishman, Keller, and Atkinson (1968) used CAI 
to present spelling drills to 29 fifth-grade students. 
Some words were presented via distributed 
practice, and other words were presented with 
massed practice. The results demonstrated that at 
the end of training the massed trials produced 
more correct responses, but 10 and 20 days later, 
the distributed practice group was superior 
(p<.025). 



In another study, Rawls and Rawls (1968) 
found no significant differences in achievement 
and retention between conventional lecture pres- 
entation and closed circuit TV. College students, 
though, regarded the TV instruction unfavorably 
and preferred classroom instruction. This was true 
even among those who achieved high grades or had 
previous TV courses. The students were observed 
looking at the TV set only 20 percent of the time, 
while they looked at the lecturer 42 percent of the 
time. 

Fidelity. Grimsley (1969a, 1969b) proposed to 
study the effects of variations in fidelity upon 
acquisition, transfer, and retention in group train- 
ing procedures. There were 12 trainees per condi- 
tion, trained in groups of four on the Nike- 
Hercules missile. They used a real (electric), a cold 
(non-electric), or an artist’s sketch of the control 
panel. The subjects were tested immediately after 
training, four weeks later, and six weeks later on 
the 92-step missile firing procedure, ^^o differences 
were found in training time, post-training perform- 
ance, performance after four and six weeks, and in 
retraining time (after six weeks). This study 
suggests that a considerable saving of costs can h3 
achieved by using a low-fidelity device. Similar 
results were found by Grimsley ( 1969a, 1969b) in 
a study that was identical except that group train- 
ing procedures were not used. 

Reduced Training Time. Longo and Mayo 
(1967) wished to determine if the 19-week air- 
borne electronics training course could be 
decreased in time to 14 weeks. Two matched 
samples of trainees were used (total = 308). The 
results proved disappointing since students in the 
longer course performed better than students in 
the shorter course. 

Johnson and Salop (1968) observed that regular 
track avionics fund^entals training requires 16 
weeks while accelerated track training needs only 
10 weeks, 'fhe accelerated course differs from the 
standard course only in speed and amount of 
redundancy*. In addition, only students of high 
ab&ity are assigned to the accelerated track. It was 
found after training that accelerated students 
scored 2.6 points below students of the same 
ability on the single track program, but 5.9 points 
higher than all one track students, and 20.8 points 
higher than that required to graduate. The authors 
estimated that use of accelerated training in 
avionics fundamentals can save $750,000 a year. 

Valverde (1969) decided to apply a systems 
apptoach to electronics maintenance training. 
First, behavior descriptions were derived from task 
analysis of the job requirements followed by the 
construction of performance tests based on the 
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objectives. Then a 14-week experimental training 
course was constructed for subjects with 
electronics aptitude scores ranging from the 60th 
to the 95th percentile. This group received only 
enough electronics theory to do the job. Another 
group with aptitude scores of 80 or better received 
the traditional 24-week course including 10 weeks 
of electronics principles. The experimental group 
was divided into two groups: 60th to 75th per- 
centile and 80th to 95th percentile. The results 
demonstrated that (a) the high-aptitude experi- 
mental group performed better on the perform- 
ance test than the medium-aptitude experimental 
group, which performed better than the tradi- 
tionally trained control group; (b) the control 
group scored better on special theory and job 
knowledge tests; and (c) the cost of the experi- 
mental program was less than the cost of the tradi- 
tional program. 

Mental Health. Kumpan (1965) was interested 
in the effect of traLung on psychiatric aids in a 
mental hospital. The trainees consisted of 48 
experiment^ subjects taking a four-month training 
program and 48 control subjects. There were two 
experimental wards of 30 patients each with the 
48 experimental aids rotating among them. 
Kumpan found that the patients in the experi- 
mental wards did, indeed, improve. P^chiatric 
aids usually have the most contact with patients, 
but they are ill-qualiHed to help them because 
they do not understand the causes of mental 
illness. 

Cochran and Steiner (1966) used an experi- 
mental group of 58 attendants for the retarded. 
They were given the Southern Regional Education 
Board Test before and after training. Sixteen 
control attendants were also used to determine if 
testing itself can cause a gain in posttest scores 
without training. Indeed, the control subjects 
gained 5.18 points (p<.01), while the experi- 
mental subjects gained 26.8 points (p<.001). Also, 
younger subjects with the least tenure seemed to 
make the greatest gains. 

Poser (1966) performed an expert nent to 
answer the question of wdiether special academic 
or intellectual knowledge is required to perform 
group therapy with schizophrenics. The three 
experimental conditions involved (a) 45 patients 
treated by psychiatrists and trained social workers, 
{b) 87 patients treated by students without any 
training, and (c) 63 untreated controls. All 
patients, before and after therapy, were given 
several tests to differentiate psychotic from 
normal, including tapping speed, reaction time, 
digit Qrmbol, color-work conflict, verbal fluency. 



and the Verdun Association List. Analysis of 
covariance was performed on the data. The results 
indicated that (a) four of six tests showed signifi- 
cant gains by the lay therapist group as compared 
with the untreated groups; (b) two of six tests 
showed significant gains as the result of therapy by 
the professional therapist; and (c) three of six tests 
showed significant gains by the lay therapists over 
the professional therapists. 

The conclusion from this experiment would 
seem to be that the use of lay therapists produced 
greater improvement than the professionally 
trained therapist. Of course, this involved only 
group therapy and not the traditional one-to-one 
situation in which a professional is most certainly 
needed. 

Leadership Training, Ritlenhouse (1953) 
compared two samples of enlisted men, one of 
which attended noncommissioned officer (NCO) 
leadership school. Both groups were compared on 
rank, assignment, and awards. The school group 
seemed to have a higher final rank and the non- 
school group had a greater gain in rank, but these 
differences were not statistically significant. The 
school graduate group had more infantry assign- 
ments (47.2 percent and 36.7 percent). Also, a 
greater proportion of the school graduate group 
received combat infantry badges. 

Hood, Showel, and Stewart (1967) contrasted 
three methods of NCO leadership training with a 
non-training group. The trained leaders demon- 
strated (a) higher evaluations, {b) greater esprit de 
corps among their subordinates, (c) better profi- 
ciency test performance, {d) better preparation, 
briefing, and control of their men, and (e) more 
frequent structuring and use of rewards and defini- 
tions. 

Barrett (1965) attempted to measure the 
impact of a 90-hour executive training program of 
the City of New York through comparison with a 
control group which did not undergo training 
(total N = 255). The results demonstrated no 
differences across groups in before- and after- 
performance ratings by peers and supervisors. The 
only measurable changes were increases in con- 
sideration and in initiating structure in the trainees 
and a decreased critical attitude toward subordi- 
nates. 

Armor Training. The Human Resources Re- 
search Organization (Baker, Cook, Wamick, & 
R o b i nson, 1 964) developed and evaluated a 
system for conducting tactical training of tank 
platoon crews. The tank crews themselves were 
trained on a miniature battlefield with radio 
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controlled tanks ai^d simulated terrain. The tank 
commanders were trained on the Army Combat 
Decisions Game using tank models on a terrain 
board. A field perfonnance test was then adminiS' 
tered to the experimentally trained crews and to a 
group of matched controls. The crew receiving 
experimental training obtained significantly higher 
scores than the matched control ciews. 

Olson andBaerman (1955) wished to determine 
if a brief course in gas conservation had any effects 
on fuel consumption in the M48 tank. The three 
experimental conditions were (a) control-rotated 
among tanks in unit, (b) control— kept own tank, 
and (c) experimental-received instruction in fuel 
economy. These researchers found that the experi- 
mental group used less fuel when considerable 
stop-and-go driving was involved. 

Reading and Verbal Instruction. Seventy-two 
scientists and engineers were trained for reading 
using a book method, and 42 were trained using 
mechanical machines (Jones & Carran, 1965). 
Different forms of the Diagnostic Reading Test 
were given before and after training. All subjects 
were found to have gained significantly after train- 
ing, but in a followup 18 months later, the book 
approach was shown to be superior. In fact, 
performance of the machine trained group actually 
decreased after the time period, while performance 
of the book trained group continued to increase 
(p<.002). 

Kelley and Mech (1967) wished to ascertain if a 
reading laboratory course could produce an 
increase in grade point average among college 
students. Twenty-three experimental subjects were 
matched with 23 controls. After three semesters 
no significant differences in grade point average 
were found. The investigators then divided their 
experimental and control groups by academic 
major. They found that {a) among education 
majors there was a statistically significant differ- 
ence after three semesters (p<.025); {b) there was 
also a statistically significant difference among 
science and mathematics students (p<.01); and (c) 
there were no significant differences among social 
studies and literature m^ors. Perhs^s, the educa- 
tion, science, and mathematics majors had an 
initially greater decrement in verbal ability, leaving 
a great deal more room for improvement. Also, 
education majors may have had a greater interest 
in reading improvement. 

Prase (1969) taught 48 undergraduates verbal 
materials using two different methods of presenta- 
tion. One method used a horizontal display of 
associations while the other used a vertical tabular 



display of associations. The results showed that 
the horizontal methods yielded superior learning, 
yet the subjects preferred the vertical tabular 
display. 

Comparative Studies of 
Low-Aptitude Subjects 

Skill Acquisition. Van Matre and Steineinan 
(1966) trained 26 low-aptitude men in an elec- 
tronics technician course in a shorter period of 
time and gave them skills more immediately useful 
on the job. This group was compared with 24 
conventionally trained personnel in a fleet follow- 
up using performance tests, ratings, interviews, and 
written tests. The results demonstrated that the 
performance of the experimental group was 
adequate and not significantly different from tlie 
conventional group in proficiency. 

Van Matre and Harrigan (1970) compared the 
performance of 54 marginally qualified electrical 
technicians with 51 weU-qualified electrical tech- 
nicians who underwent training. These groups 
were compared after they were on the job in the 
fleet for 24 months. A rating scale and a struc- 
tured interview score were used as criteria. The 
conventionally trained men were rated as more 
capable in troubleshooting and use of test equip- 
ment, but were not generally rated differently 
from low-aptitude men. In fact, the lowest ratings 
obtained by low-aptitude men were average. 

Mayo (1969) administered an aviation struc- 
tural mechanic course to 30 Category IV per- 
sonnel, i.e., the lowest 30 percent on the Armed 
Forces Qualification Test (AFQT). The fleet 
performance of this group was then compared 
with that of personnel who scored above the 30th 
percentile. Among the low-aptitude men, perform- 
ance varied from highly satisfactoiy to unsatis- 
factory with no way of predicting which men 
would perform adequately. Low-aptitude men 
were found to have lower ratings (/^<.05) than the 
other groups. Based on these results, Mayo 
suggested that Category IV personnel should not 
be used for this Naval rating unless there is a man- 
power shortage. It is noted, however, that the 
comparison group was given 25 percent more 
training and that ratings were used as criteria 
rather than performance tests. 

Hooprich (1968) wished to determine the 
appropriateness of commissaryman training for 
Category IV personnel. The results, based on two 
studies, demonstrated that {a) 31 of 35 Category 
IV subjects successfully completed training, 
regardless of their low reading ability, although 
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their grades were significantly lower than the 
comparison group; (b) Category IV .subjects 
needed to devote more outside time to study, and 
they required more time from instructors to meet 
criterion; (c) the differences across groups were 
most evident on paper-and-pencil tests and least 
evident on actual performance tests; (d) AFQT 
scores failed to predict school performance; and 
(e) reading test scores were significantly correlated 
with some aspects of performance. 

Standlee and Saylor (1969) performed an 
equipment operator training study with Category 
IV subjects. The performance of six Category IV 
subjects was compared with 16 subjects who were 
not so classified. Then, the AFQT scores for this 
group and for commissaryman training were 
combined to determine if AFQT score predicted 
performance. It was found that (a) all Category IV 
subjects passed the course; (2i) scores of the Cate- 
gory IV subjects were lower, especially on written 
tests as opposed to the more practical perfonnancc 
tests; (c) AFQT scores were unrelated to achieve- 
ment; (d) mathematics was a source of trouble for 
Category IV personnel; and (e) Category IV men 
needed more individual attention and counselling. 

Fox, Taylor, and Caylor(1969) compared the 
performance of low-aptitude men with higher apti- 
tude men on several training tasks: visual monitor- 
ing, rifie assembly, missile preparation, phonetic 
alphabet, map plotting, and combat plotting. 
Low-aptitude groups needed 2 to 4 times as much 
training time, 2 to 5 times more training trials, and 
2 to 6 times as much prompting to reach criterion. 
Middle-aptitude group performance was found to 
be more like that of the high-aptitude group than 
the low-aptitude group. The authors concluded 
that individual differences in aptitude must be 
recognized, and training programs must be 
designed to account for these differences. 

Grunzke, Guinn, and Stauffer (1970) evaluated 
the performance of 26,915 low-aptitude men who 
were taken into the Air Force even though they 
were below the minimum acceptable level. The 
findings demonstrated that the low-aptitude men, 
as cqmpared with subjects with higher aptitude, 
had (a) a smaller percentage completing basic 
trailing, (b) more disciplinary problems, (c) more 
unsuitable discharges, and (d) a lower percentage 
attaining skill level. In addition, among low- 
aptitude men, hi^ school graduates and whites 
performed better than high school non-graduates 
and Negroes. 

In another study, a manpower training program 
was surveyed by comparing 1,062 program grad- 
uates with 444 program dropouts (Trooboff, 



1968). The results showed tliat 84 percent of the 
graduates received employment while only 67 
percent of the dropouts received employment. 
Also, the average earnings of graduates increased 
from $.98 to $1.76 (79 percent), while the average 
earnings of dropouts increased from $1.07 to 
$1.51 (29 percent). Even though several factors 
were left uncontrolled, the author concluded that 
the program was successful. 

Individualized Training, McFann (1969a, 
1969b) foui.d that the differences between high- 
and low-aptitude men in basic combat training 
were greatest on cognitive tasks and that the 
difference was not as marked on motor skills and 
proficiency tests, with most low-aptitude men 
meeting standard. In the study, high-, middle-, and 
low-aptitude groups were selected and trained, 
using videotape, a one-to-one student to teacher 
ratio, feedback, reinforcement, and small incre- 
ments. In some tasks, low-aptitude men reached 
standard, but took 2 to 4 times longer, and in 
other cases they failed to master the material at 
all. McFann also found that aptitude interacts with 
method of instruction. The high-aptitude group 
was found to learn equally well with lecture or 
individualized training, whUe the low-aptitude 
group learned well with individualized training, 
but not with lecture. 

J. Taylor (1970) found that both high- and 
low-aptitude personnel learn faster when given 
wire splice training via audiotape and slides as 
compared with a programmed book. For the 
high-aptitude personnel, the programmed book 
required 25 percent more training time; for the 
low-aptitude group, it took 50 percent more train- 
ing time. From these results, Taylor suggests that 
training be adapted to individual differences. 

Language Skills. Vineberg, Sticht, Taylor, and 
Caylor (1970) found that military training 
manuals were 6 to 8 grade levels above the reading 
level of Category IV personnel, and 4 to 6 grade 
levels above the reading level of higher aptitude 
subjects. Many of these individuals relied more 
heavily on asking and listening to others. In 
another study, Sticht (1969) found that among 
low-aptitude men learning by listening was more 
effective than learning by reading, although some 
did better by reading. 

Siunmaiy 

This chapter contained reviews of several com- 
parative ev^uation studies. Some of the studies 
were concerned with comparative evaluation of 
new training methods while others were concerned 
with methods of training low-aptitude personnel. 



With regard to the training of low-aptitude men, 
more practical and individualized and less 
theoretical training seems superior to standard 
training procedures. 



VII. DISCUSSION 

There has been an increasing trend in the past 
decade in the use of factor analysis and other 
multivariate statistical techniques. Employment of 
these techniques has been made more feasible by 
the increased availability of high-speed computers. 
Many investigators, though, tend to use factor 
analysis as ^ end product or explanation rather 
than as an aid in da^a analysis. Factor analytic 
research can be misleading since the factors 
derived from the matrix reduction are directly 
dependent upon the variables making up the corre- 
lation matrix. This is a question of content 
validity. If the variable input is biased, then the 
results (factors) will be biased. In addition, most 
of the recent factor analytic literature has been so 
abstruse that it is difficult to understand the ideas 
presented, much less to implement them. 

There has not been enough attention to 
canonical correlation, Q-factor analysis, and multi- 
variate research design. No evaluative studies were 
found in which the first two of these methods 
were used, and too few studies using the latter 
were observed. Perhaps some of these sophisti- 
cated techniques are not appropriate to the data 
collected. In fact, a large portion of the data 
collected are not worthy of any analysis. 

A large portion of the authors of the research 
studies reported in this review are guilty of 
violating one or more of the following canons of 
statistical methodology: (a) use of too few 
subjects; (b) use of inappropriate statistical tech- 
niques; (c) failure to use control groups, or use of 
inadequate controls; (d) use of improper sampling 
procedures; and (e) use of inappropriate, con- 
taminated, or unreliable criteria. 

Other quantitative methods which are given 
much lip service, but whidh are little used in 
practice except by their authors, are (a) sequential 
testing, (b) criterion-referenced testing, (c) confi- 
dence testing, (d) part correlation, (e) magnitude 
estimation, and (f) application of theory of signal 
detection. It behooves other investigators to try 
these techniques. Such methods can increase the 
sensitivity and generalizability of research findings. 

One method which others are beginning to use 
is Campbell and Fiske’s (1959) technique for 



establishing convergent and discriminant validity. 
Convergent validity exists if there is a high correla- 
tion between tests purporting to measure the same 
thing; and discriminant validity exists when tests 
measuring different factors arc independent. This 
technique should prove very useful in the future 
for psychometricians involved in test construction 
and validation. 

Another innovation which will come more into 
vogue is cost-effectiveness, or cost-benefit, anal- 
ysis. This criterion is useful, as for as any other 
ratio, only if there is an adequate data base for 
both the numerator and the denominator of the 
ratio. Thus, the technique demands more precise 
economics and performance evaluative data. 

Althougli the moderator variable technique is 
properly a subtopic under statistical methods, its 
emphasis in the recent literature demanded that it 
be given treatment in a separate chapter of this 
review. A test or measure can be a moderator 
variable when its use differentially determines the 
predictability of another test or measure. Almost 
any test score may be a potential moderator 
variable as are race, sex, personality, and other 
background factors. 

Cognitive style seems to differ across deprived 
and non-deprived groups and must be accounted 
for and taken into consideration in order that the 
potential of the human resources in our society 
can be maximized. 

Several studies were surveyed which use race 
and aptitude as moderator variables. One impor- 
tant conclusion (Boehm, 1971) to be drawn from 
this research is that objective and performance 
oriented dependent measures are less likely to 
show differences across racial groups than the 
more subjective rating methods. Another conclu- 
sion (McFann, 1969a, l%9b) is that high-aptitude 
groups learn equally well with lecture or individ- 
ualized training, while low-aptitude groups learn 
well with individualized training but not with 
lecture. 

Individualized or programmed instruction is 
another major educational trend which has 
achieved prominence in the last five or ten years. 
Individualized or programmed instruction repres- 
ents an amalgam of the principles of learning 
theory with the idicsyncracies of the individual. 
Programmed instruction can be sequential, 
allowing the individual to proceed in very small 
steps through a fixed instructional sequence, or 
branched. Branching allows the individual’s 
progress to be governed by his own responses. 



Sequential testing has been used in individualized 
instruction in order to ascertain rapidly the level 
of knowledge possessed by the student. Also, 
criterion-referenced tests, rather than norm- 
referenced tests, have been employed, since the 
student must be able to perform each unit of 
instruction at a certain level of proficiency before 
advancing to the next unit of instruction. 

Computer assisted instruction (CAI) is the 
application of computers to programmed instruc- 
tion. CAI can be especially practical when a large 
number of short tests must be given to the trainee, 
and when instructor-student interaction is not 
considered crucial to learning. 

Another noted trend was an increased concern 
with cross-cultural training and evaluation. Here, 
the “cultural assimilator” (Fiedler, Mitchell, & 
Triandis, 1970; Worchel & Mitchell, 1970) seemed 
to possess some merit. In this method, critical 
incidents are obtained regarding circumstances in 
which the norms of behaviors across cultures are 
quite different. Questions are asked about the 
incident, and the multiple-choice answer format is 
employed. The responses of a target sample from 
the host culture are employed to provide the 
correct answer keying. 

Similarly, emphasis on increasing basic skills 
generally and reading skiU specifically has achieved 
import. Courses in reading instruction have 
produced gains in reading speed, retention of 
reading speed, and transfer. No single method of 
reading instruction seems to have demonstrated 
superiority to another. 

A method developed by Greer, Smith, and 
Hatfield (1967) has to some degree eliminated 
rater bias in helicopter checkpilots. After a task 
analysis, proficiency tests and instrument observa- 
tion were substituted for the checkpilot’s own 
evaluation method. This technique was able to (a) 
increase the reliability of evaluation, (b) identify 
specific student deficiencies, and (c) increase 
checkpilot consistency. 

Siegel and Schultz (1961) and Siegel, Schultz, 
and Federman (1961) constructed an evaluative 
technique using matrix concepts which was 
successfully applied to a military setting (Schultz 
& Siegel, 1962). These writers feel that training is 
good if the average trainee performs proficiently 
on important tasks. Training is poor if the average 
worker performs poorly on important tasks. This 
method identifies deficiencies in the training 
program which need emphasis and those parts of 
the training program >^^ich need deemphasis. 



The comparative studies discussed in this review 
were concerned with relative comparisons between 
two or more methods of instruction or training. In 
most cases a new training method was compared 
with a standard method to determine if the latter 
should be modified or replaced. Some of the 
conclusions to be drawn from this research are 
presented. 

1 . CAI is superior to standard instruction for 
electronics technicii is in terms of achieve- 
ment and speed (Hurlock, 1971). 

2. If personnel shortages exist, job experi- 
enced Air Force instructors may be used in 
practical shop related courses, and 
Instructors who are not job experienced 
may be used in lecture courses (Askren & 
Valentine, 1970). 

3. Some of the newer Army marksmanship 
training methods are superior to the older, 
standard methods (McFann, Buchanan, 
Lyons, Ward, & Waits, 1958; Olmstead, 
1968). 

4. The benefits of simulator training are vari- 
able and seem to be dependent on a multi- 
plicity of factors. 

5. CAI, in the overall, seems to be a cost- 
effective training technique. 

6. Students indicate a preference for 
traditional lectures over TV instruction 
(Fishman, Keller, Atkinson, 1968). 

7. Variations in the fidelity of a trainer seem 
to produce no observable performance 
differences. 

8. Accelerated training is successful for high- 
aptitude students in avionics fundamentals 
training (Johnson & Salop, 1968). 

9. NCO leadership training resulted in im- 
proved leader behavior over a no-training 
group (Hood, Showel, & Stewart, 1967). 

10. Fuel conservation training can reduce fuel 
consumption in drivers of the M48 tank 
(Olson & Baerman, 1955). 

11. A programmed book reading instruction 
course produces greater long-term improve- 
ment than machine training (Jones & 
Carran, 1965). 

There has also been considerable recent concern 
with low-aptitude individuals who, generally, can 
perform many skilled tasks adequately when given 
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proper training. They tend to be slower learners 
and retain knowledge best when taught by 
practical rather than highly verbal means. 

Finally, systematic approaches to evaluation and 
course development are beginning to receive some 
emphasis. Th e attempt to account for almost all 
of the variables that can affect training and 
student behavior. Most systems begin wi.n a job 
analysis in order to derive a list of behaviorally 
oriented job requirements from which training 



objectives can be formulated. Mam' writers 
advocate a pre-training appraisal of tl.o entering 
students in order to direct them to tf» training 
method which is mor.t suited to their needs and 
abilities. Criterion-referenced tests and other 
measures of student behavior are then constructed 
in order to reflect the training objectives. Finally, 
after training, the students and the training 
program are evaluated through various means. 
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